Machine Learning

Table of Content

Introduction

Machine Learning is about making machines get better at some task by learning from data, instead of having to explicitly code rules. There are many different types of ML systems: supervised or not, batch or online, instance-based or model-based, and so on. Some common problems we encounter in machine learning include:

In a machine learning project, we start with framing the problem. This is important because it will determine

We now quickly review some prerequisites from probability and statistics. The following sources was used to prepare this note:

Probability

A function pp that assigns a real number p(A)p(A) to each event AA is a probability distribution or a probability measure if it satisfies the following three axioms:

p(i=1Ai)=i=1p(Ai)p(\bigcup_{i=1}^\infty A_i ) = \sum_{i=1}^\infty p(A_i)

If is finite and if each outcome AA is equally likely, then p(A)=Ap(A) =\frac{|A|}{|Ω|}, which is called the uniform probability distribution. To compute probabilities, we need to count the number of points in an event AA. Generally, it is not feasible to assign probabilities to all subsets of a sample space . Instead, one restricts attention to a set of events called a σ-algebra or a σ-field which is a class A\mathcal A that satisfies:

The sets in A\mathcal A are said to be measurable. We call (,A)(Ω, \mathcal A) a measurable space. If pp is a probability measure defined on A\mathcal A, then (,A,p)(Ω, \mathcal A, p) is called a probability space. When is the real line, we take A\mathcal A to be the smallest σ-field that contains all the open subsets, which is called the Borel σ-field.

Independent Events

If we flip a fair coin twice, then the probability of two heads is 1/2 × 1/2. We multiply the probabilities because we regard the two tosses as independent. Two events AA and BB are independent if p(AB)=p(A)p(B)p(AB) = p(A)p(B). A set of events {Ai:iI}\{A_i : i ∈ I\} is independent if

p(iJAi)=iJp(Ai)p(\bigcap_{i\in J} A_i) = \prod_{i\in J} p(A_i)

for every finite subset JJ of II.

For example, in tossing a fair die, let A={2,4,6}A= \{2, 4, 6\} and let B={1,2,3,4}B= \{1, 2, 3, 4\}. Then, AB={2,4}A B= \{2, 4\}, P(AB)=2/6=P(A)P(B)=(1/2)×(2/3)P(AB) = 2/6 = P(A)P(B) = (1/2) × (2/3) and so AA and BB are independent. Suppose that AA and BB are disjoint events, each with positive probability. Can they be independent? No. This follows since p(A)p(B)>0p(A)p(B) > 0 yet p(AB)=p(ϕ)=0p(AB) = p(\phi) = 0. Except in this special case, there is no way to judge independence by looking at the sets in a Venn diagram.

Conditional Probability

If p(B)>0p(B) > 0 then the conditional probability of AA given BB is defined by

p(AB)=p(AB)p(B)p(A\mid B) = \frac{p(AB)}{p(B)}

As a consequence of this definition, AA and BB are independent events if and only if

p(AB)=p(A).p(A \mid B) = p(A).

Also, for any pair of events AA and BB,
p(AB)=p(AB)p(B)=p(BA)p(A).p(AB) = p(A\mid B)p(B) = p(B\mid A)p(A).

For any fixed BB such that p(B)>0p(B) > 0, P(B)P(· \mid B) is a probability (i.e., it satisfies the three axioms of probability). In particular,

p(i=1AiB)=i=1p(AiB)p(\bigcup_{i=1}^\infty A_i \mid B) = \sum_{i=1}^\infty p(A_i \mid B)

But it is in general not true that p(ABC)=p(AB)+p(AC)p(A\mid B\cup C) = p(A\mid B) + p(A\mid C). The rules of probability apply to events on the left of the bar.

Random Variable

A random variable is a mapping X:RX : Ω → \mathbb R that assigns a real number X(ω)X(ω) to each outcome ωω.

Distribution Functions and Probability Functions

Given a random variable X, we define the cumulative distribution function (or distribution function) as follows:

The cumulative distribution function, or cdf, is the function FX:R[0,1]F_X : \mathbb R → [0, 1] defined by FX(x)=p(Xx)F_X (x) = p(X ≤ x). It can be shown that if XX have cdf FF and let YY have cdf GG and F(x)=G(x)F (x) = G(x) for all xx, then p(XA)=p(YA)p(X ∈ A) = p(Y ∈ A) for all AA.

We define the probability function or probability mass function for a discrete XX (takes countably many values) by fX(x)=p(X=x)f_X (x) = p(X= x). Thus, fX(x)0f_X (x) ≥ 0 for all xRx ∈ \mathbb R and ifX(xi)=1\sum_i f_X (x_i) = 1. The cdf of XX is related to fXf_X by

FX(x)=p(Xx)=xixfX(xi).F_X (x) = p(X ≤ x) = \sum_{x_i \le x} f_X (x_i).

For a continuous random variable XX, fXf_X is called the probability density function (pdf) if fX(x)0f_X (x) ≥ 0 for all xx and

fX(x)dx=1\int_{-\infty}^\infty f_X(x)dx = 1

and for every aba ≤ b,

p(a<X<b)=abfX(x)dxp(a < X < b) = \int_a^b f_X(x)dx

Also, fX(x)=FX(x)f_X (x) = F'_X (x) at all points xx at which FXF_X is differentiable. Note that if XX is continuous then p(X=x)=0p(X= x) = 0 for every xx. We get probabilities from a pdf by integrating. A pdf can be bigger than 1 (unlike a mass function). In fact, a pdf can be unbounded. We call F1(1/4)F^{−1}(1/4) the first quartile, F1(1/2)F^{−1}(1/2) the median (or second quartile), and F1(3/4)F^{−1}(3/4) the third quartile.

Some Important Discrete Random Variables

Note that XX is a random variable in all the cases; xx denotes a particular value of the random variable; nn, pp or λ\lambda are parameters, that is, fixed real numbers. The parameters such as p,λp, \lambda are usually unknown and must be estimated from data; that’s what statistical inference is all about.

Some Important Continuous Random Variables

Bivariate Distributions

Given a pair of discrete random variables XX and YY , define the joint mass function by f(x,y)=P(X=x,Y=y)f (x, y) = P(X= x, Y= y). We write ff as fX,Yf_{X,Y} when we want to be more explicit. In the continuous case, we call a function f(x,y)f (x, y) a pdf for the random variables (X,Y)(X, Y ) if

Marginal Distributions

If f(X,Y)f(X, Y ) have joint distribution with mass function fX,Yf_{X,Y} , then the marginal mass function for XX is defined by
fX(x)=p(X=x)=p(X=x,Y=y)=yf(x,y)f_X (x) = p(X= x) = p(X= x, Y= y) = \sum_y f(x, y)

It is similar for YY. For continuous random variables, the marginal densities are
fX(x)=f(x,y)dy,  and  fY(y)=f(x,y)dx.f_X(x) = \int f (x, y)dy, \; \text{and} \; f_Y (y) = \int f (x, y)dx.

The corresponding marginal distribution functions are denoted by FXF_X and FYF_Y.

Independent Random Variables

Two random variables XX and YY are independent if, for every AA and BB,

p(XA,YB)=p(XA)p(YB)p(X\in A, Y\in B) = p(X\in A) p(Y\in B)

Otherwise we say that XX and YY are dependent. Suppose that the range of XX and YY is a (possibly infinite) rectangle. If f(x,y)=g(x)h(y)f (x, y) = g(x)h(y) for some functions gg and hh (not necessarily probability density functions) then XX and YY are independent.

Conditional Distributions

If XX and YY are discrete, then we can compute the conditional distribution of XX given that we have observed Y=yY= y. Specifically,
p(X=xY=y)=P(X=x,Y=y)P(Y=y).p(X= x \mid Y= y) = \frac{P(X=x, Y= y)}{P(Y= y)}.

This leads us to define the conditional probability mass function as follows. For continuous random variables, the conditional probability density function is
fXY(xy)=fX,Y(x,y)fY(y)f_{X\mid Y} (x \mid y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}

assuming that fY(y)>0f_Y (y) > 0. Then,
p(XAY=y)=AfXY(xy)dx.p(X ∈ A\mid Y= y) = \int_A f_{X\mid Y} (x\mid y)dx.

We are treading in deep water here. When we compute p(XAY=y)p(X ∈ A\mid Y= y) in the continuous case we are conditioning on the event {Y=y}\{Y= y\} which has probability 0. We avoid this problem by defining things in terms of the pdf. The fact that this leads to a well-defined theory is proved in more advanced courses. Here, we simply take it as a definition.

Bayesian Probabilities

Bayes’ theorem is used to convert a prior probability p(w)p(\bm w) into a posterior probability p(wD)p(\bm w\mid \mathcal{D}) by incorporating the evidence p(Dw)p(\mathcal{D}\mid \bm w) provided by the observed data. We capture our assumptions about w\bm w, before observing the data, in the form of a prior probability distribution p(w)p(\bm w). The effect of the observed data D=t1,...,tN\mathcal{D}= {t_1, . . . , t_N } is expressed through the conditional probability p(Dw)p(\mathcal{D}\mid \bm w). Bayes’ theorem, which takes the form

p(wD)=p(Dw)p(w)p(D)p(\bm w\mid \mathcal{D}) = \frac{p(\mathcal{D}\mid \bm w)p(\bm w)}{p(\mathcal{D})}

then allows us to evaluate the uncertainty in w\bm w after we have observed D\mathcal{D} in the form of the posterior probability p(wD)p(\bm w|\mathcal{D}). The quantity p(Dw)p(\mathcal{D}\mid \bm w) on the right-hand side of Bayes’ theorem is evaluated for the observed dataset D\mathcal D and can be viewed as a function of the parameter vector w\bm w, in which case it is called the likelihood function. It expresses how probable the observed dataset is for different settings of the parameter vector w\bm w. Note that the likelihood is not a probability distribution over w\bm w, and its integral with respect to w\bm w does not (necessarily) equal 1.

Given this definition of likelihood, we can state Bayes’ theorem in words posterior ∝ likelihood × prior where all of these quantities are viewed as functions of w\bm w. The denominator in the equation above is the normalization constant, which ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to 1. Indeed, integrating both sides of that equation with respect to w\bm w, we can express the denominator in Bayes’ theorem in terms of the prior distribution and the likelihood function

p(D)=p(Dw)p(w)dw.p(\mathcal{D}) = \int p(\mathcal{D}|\bm w)p(\bm w) d\bm w.

In both the Bayesian and frequentist paradigms, the likelihood function p(Dw)p(\mathcal{D}\mid \bm w) plays a central role. However, the manner in which it is used is fundamentally different in the two approaches:

A widely used frequentist estimator is maximum likelihood, in which w\bm w is set to the value that maximizes the likelihood function p(Dw)p(\mathcal{D}\mid \bm w). This corresponds to choosing the value of w\bm w for which the probability of the observed dataset is maximized. In the machine learning literature, the negative log of the likelihood function is called an error function. Because the negative logarithm is a monotonically decreasing function, maximizing the likelihood is equivalent to minimizing the error.

One approach to determining frequentist error bars is the bootstrap (Efron, 1979; Hastie et al., 2001), in which multiple datasets are created as follows: Suppose our original data set consists of N data points X=x1,,xNX= {x_1, \dots, x_N }. We can create a new data set XBX_B by drawing NN points at random from XX, with replacement, so that some points in XX may be replicated in XBX_B, whereas other points in XX may be absent from XBX_B. This process can be repeated LL times to generate LL datasets each of size NN and each obtained by sampling from the original data set XX. The statistical accuracy of parameter estimates can then be evaluated by looking at the variability of predictions between the different bootstrap datasets.

One advantage of the Bayesian viewpoint is that the inclusion of prior knowledge arises naturally. Suppose, for instance, that a fair-looking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the probability of landing heads would give 1 implying that all future tosses will land heads! By contrast, a Bayesian approach with any reasonable prior will lead to a much less extreme conclusion.

Gaussian Distribution

It is convenient, however, to introduce here one of the most important probability distributions for continuous variables, called the normal or Gaussian distribution. For the case of a single real-valued variable x, the Gaussian distribution is defined by

N(xμ,σ2)=12πσexp(12(xμσ)2)\mathcal{N}(x\mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\Bigg(-\frac{1}{2}\Big(\frac{x-\mu}{\sigma}\Big)^2\Bigg)

which is governed by two parameters: μ\mu, called the mean, and σ2\sigma^2, called the variance. The square root of the variance, given by σ\sigma, is called the standard deviation, and the reciprocal of the variance, written as β = 1/σ2\sigma^2, is called the precision. Gaussian distribution defined over a D-dimensional vector x\bm x of continuous variables, which is given by

N(xμ,Σ)=1(2π)D/21Σ1/2exp(12(xμ)TΣ1(xμ))\mathcal N(\bm x\mid \bm \mu,\bm \Sigma) = \frac{1}{(2π)^{D/2}} \frac{1}{|\Sigma|^{1/2}}\exp\Big(-\frac{1}{2}(\bm x-\bm \mu)^T\Sigma^{-1}(\bm x-\bm \mu)\Big)

where the DD-dimensional vector μ\bm \mu is called the mean, the D×DD × D matrix ΣΣ is called the covariance, and Σ|Σ| denotes the determinant of ΣΣ. The log likelihood function is:

lnp(xμ,σ2)=12σ2n=1N(xnμ)2+N2lnσ2N2ln(2π).\ln p(\bm x \mid \mu, \sigma^2) = − \frac{1}{2\sigma^2}\sum_{n=1}^N\big(x_n- \mu) ^2 + \frac{N}{2}\ln \sigma^2 − \frac{N}{2}\ln(2\pi) .

The maximum likelihood solution with respect to μ\mu given by:

μML=1Nn=1Nxn\mu_{ML} = \frac{1}{N}\sum_{n=1}^N x_n

which is the sample mean, i.e., the mean of the observed values {xn}\{x_n\}. Similarly, maximizing likelihood with respect to σ2σ^2, we obtain the maximum likelihood solution for the variance in the form

σML2=1Nn=1N(xnμML)2\sigma^2_{ML} = \frac{1}{N}\sum_{n=1}^N (x_n-\mu_{ML})^2

which is the sample variance measured with respect to the sample mean μML\mu_{ML}. Note that the maximum likelihood solutions µMLµ_{ML} and σML2\sigma^2_{ML} are functions of the dataset values x1,...,xNx_1, . . . , x_N . Consider the expectations of these quantities with respect to the dataset values, which themselves come from a Gaussian distribution with parameters µMLµ_{ML} and σML2\sigma^2_{ML}. It is straightforward to show that

E[μML]=μE[σML2]=N1Nσ2\begin{align*} \mathbb E[\mu_{ML}] &= \mu\\ \mathbb E[\sigma^2_{ML}] &= \frac{N-1}{N}\sigma^2 \end{align*}

so that on average the maximum likelihood estimate will obtain the correct mean but will underestimate the true variance by a factor (N1)/N(N− 1)/N. It follows that the following estimate for the variance parameter is unbiased:

σ~2=NN1σML2=1N1n=1N(xnμML)2\tilde\sigma^2 = \frac{N}{N-1}\sigma^2_{ML} = \frac{1}{N-1}\sum_{n=1}^N (x_n-\mu_{ML})^2

drawing

The green curve shows the true Gaussian distribution from which data is generated, and the three red curves show the Gaussian distributions obtained by fitting to three datasets, each consisting of two data points shown in blue, using the maximum likelihood results. Averaged across the three datasets, the mean is correct, but the variance is systematically under-estimated because it is measured relative to the sample mean and not relative to the true mean.

Random Sample

If X1,...,XnX_1, . . . , X_n are independent and each has the same marginal distribution with cdf F , we say that X1,...,XnX_1, . . . , X_n are iid (independent and identically distributed) and we write X1,...XnFX_1, . . . X_n ∼ F. If FF has density ff we also write X1,...XnfX_1, . . . X_n ∼ f. We also call X1,...,XnX_1, . . . , X_n a random sample of size nn from FF.

Expectation of a Random Variable

The expected value, or mean of XX is defined to be
E(X)=xdF(x)={xxf(x)if X is discretexf(x)dxif X is continuous\mathbb E(X) = \int x dF (x) = \begin{cases} \sum_x xf (x) & \text{if $X$ is discrete} \\ \int xf (x)dx & \text{if $X$ is continuous} \end{cases}

assuming that the sum (or integral) is well defined. We use the following notation to denote the expected value of XX:
E(X)=xdF(x)=μ=μX.\mathbb E(X) = \int x dF (x) = \mu= \mu_X.

If xxdFX(x)<\int_x |x| dF_X (x) < ∞. Otherwise we say that the expectation does not exist. The mean, or expectation, of a random variable XX is the average value of XX. If Y=r(X)Y = r(X) then

E(Y)=r(x)dF(x).\mathbb E(Y) = \int r(x) dF (x).

Properties of Expectations

If X1,...,XnX_1, . . . , X_n are random variables and a1,...,ana_1, . . . , a_n are constants, then

E(iaiXi)=iaiE(Xi).\mathbb E \Big( \sum_i a_i X_i \Big) = \sum_i a_i \mathbb E(X_i).

If X1,...,XnX_1, . . . , X_n are independent random variables. Then

E(i=1nXi)=iE(Xi)\mathbb E \Big( \prod_{i=1}^n X_i \Big) = \prod_i \mathbb E(X_i)

Variance and Covariance

The variance measures the “spread” of a distribution. Let XX be a random variable with mean μ\mu. The variance of XX, σ2σ^2 or σX2σ^2_X or V(X)\mathbb V(X) is defined by
σ2=E(Xµ)2=(xµ)2dF(x)σ^2 = \mathbb E(X− µ)^2 = \int (x− µ)^2dF (x)

assuming this expectation exists. The standard deviation is s(X)=V(X)s(X) = \sqrt{\mathbb V(X)} and is also denoted by σσ and σXσ_X. Assuming the variance is well defined, it has the following properties:

V(i=1naiXi)=i=1nai2V(Xi)\mathbb V \Big(\sum_{i=1}^n a_i X_i \Big) = \sum_{i=1}^n a^2_i \mathbb V(X_i)

If X1,...,XnX_1, . . . , X_n are random variables then we define the sample mean to be

Xˉn=1ni=1nXi\bar X_n = \frac{1}{n} \sum_{i=1}^n X_i

and the sample variance to be

Sn2=1n1i=1n(XiXˉn)2.S^2_n = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X_n)^2.

Let X1,...,XnX_1, . . . , X_n be iid and let μ=E(Xi)\mu= \mathbb E(X_i), σ2=V(Xi)σ^2 = \mathbb V(X_i). Then

E(Xˉn)=μ,    V(Xˉn)=σ2n,    E(Sn2)=σ2\mathbb E (\bar X_n) = \mu, \;\; \mathbb V(\bar X_n) = \frac{\sigma^2}{n}, \;\; \mathbb E(S^2_n) = \sigma^2

If XX and YY are random variables, then the covariance and correlation between XX and YY measure how strong the linear relationship is between XX and YY. Let XX and YY be random variables with means μX\mu_X and μY\mu_Y and standard deviations σXσ_X and σYσ_Y. Define the covariance between XX and YY by

Cov(X,Y)=E((XμX)(YμY)),\text{Cov}(X, Y ) = \mathbb E \Big((X− \mu_X )(Y− \mu_Y ) \Big),

and the correlation by

ρ=ρX,Y=ρ(X,Y)=Cov(X,Y)σXσY.ρ= ρ_{X,Y} = ρ(X, Y ) = \frac{Cov(X, Y )}{σ_X σ_Y}.

The covariance satisfies:
Cov(X,Y)=E(XY)E(X)E(Y).Cov(X, Y ) = \mathbb E(XY )− \mathbb E(X) \mathbb E(Y ).

The correlation satisfies: 1ρ(X,Y)1−1 ≤ ρ(X, Y ) ≤ 1. If Y=aX+bY= aX + b for some constants aa and bb then ρ(X,Y)=1ρ(X, Y ) = 1 if a>0a > 0 and ρ(X,Y)=1ρ(X, Y ) =−1 if a<0a < 0. If XX and YY are independent, then Cov(X,Y)=ρ=0Cov(X, Y ) = ρ= 0. The converse is not true in general. In general, V(X+Y)=V(X)+V(Y)+2Cov(X,Y)\mathbb V(X + Y ) = \mathbb V(X) + \mathbb V(Y ) + 2Cov(X, Y ) and V(XY)=V(X)+V(Y)2Cov(X,Y)\mathbb V(X− Y ) = \mathbb V(X)+ \mathbb V(Y )−2Cov(X, Y ). More generally, for random variables X1,...,XnX_1, . . . , X_n,
V(iaiXi)=iai2V(Xi)+2i<jaiajCov(Xi,Xj)\mathbb V\Big( \sum_i a_i X_i \Big) = \sum_i a^2_i \mathbb V(X_i) + 2 \sum \sum_{i<j} a_i a_j Cov(X_i, X_j)

Consider a random vector XX of the form
X=(X1Xn)X = \begin{pmatrix} X_1\\ \vdots\\ X_n \end{pmatrix}

Then the mean of XX is

μ=(E(X1)E(Xn))\mu = \begin{pmatrix} \mathbb E(X_1)\\ \vdots\\ \mathbb E(X_n) \end{pmatrix}

The variance-covariance matrix Σ\Sigma is defined to be

V(X)=(V(X1)Cov(X1,X2)Cov(X1,Xn)Cov(X1,X2)V(X2)Cov(X2,Xn) Cov(Xn,X1)Cov(Xn,Xn1)V(Xn))\mathbb V(X)= \begin{pmatrix} \mathbb V(X_1) & Cov(X_1, X_2) & \dots & Cov(X_1, X_n)\\ Cov(X_1, X_2) & \mathbb V(X_2) & \dots & Cov(X_2, X_n)\\ \vdots& \vdots & \vdots & \vdots \\\ Cov(X_n, X_1) & \dots & Cov(X_n, X_{n-1}) & \mathbb V(X_n) \end{pmatrix}

If aa is a vector and XX is a random vector with mean μ\mu and variance ΣΣ, then E(aTX)=aTμ\mathbb E(a^T X) = a^T \mu and V(aTX)=aTΣa\mathbb V(a^T X) = a^T Σa. If AA is a matrix then E(AX)=Aμ\mathbb E(AX) = A\mu and V(AX)=AΣAT\mathbb V(AX) = AΣA^T.

Conditional Expectation

Suppose that XX and YY are random variables. What is the mean of XX among those times when Y=yY= y? The answer is that we compute the mean of XX as before but we substitute fXY(xy)f_{X \mid Y} (x\mid y) for fX(x)f_X (x) in the definition of expectation. The conditional expectation of XX given Y=yY= y is

E(XY=y)={xxfXY(xy)    discrete casexfXY(xy)dx    continuous case.E(X|Y= y) = \begin{cases} \sum_x x f_{X\mid Y} (x\mid y) \;\; \text{discrete case} \\ \int x f_{X\mid Y} (x\mid y) dx \;\; \text{continuous case}. \end{cases}

If r(x,y)r(x, y) is a function of xx and yy then

E(r(X,Y)Y=y)={xr(x,y)fXY(xy)    discrete caser(x,y)fXY(xy)dx    continuous case.E(r(X,Y)|Y= y) = \begin{cases} \sum_x r(x,y) f_{X\mid Y} (x\mid y) \;\; \text{discrete case} \\ \int r(x,y) f_{X\mid Y} (x\mid y) dx \;\; \text{continuous case}. \end{cases}

Note that whereas E(X)\mathbb E(X) is a number, E(XY=y)\mathbb E(X|Y= y) is a function of yy. Before we observe YY, we don’t know the value of E(XY=y)\mathbb E(X|Y= y) so it is a random variable which we denote E(XY)\mathbb E(X|Y).

(The Rule of Iterated Expectations). For random variables XX and YY, assuming the expectations exist, we have that

E(E(YX))=E(Y)E(E(XY))=E(X).\mathbb E (\mathbb E(Y |X)) = \mathbb E(Y ) \\ \mathbb E (\mathbb E(X|Y )) = \mathbb E(X).

More generally, for any function r(x,y)r(x, y) we have
E(E(r(X,Y)X))=E(r(X,Y)).\mathbb E (\mathbb E(r(X, Y )|X)) = \mathbb E(r(X, Y )).

The conditional variance is defined as
V(YX=x)=(yµ(x))2f(yx)dy\mathbb V(Y |X= x) = \int (y− µ(x))^2 f (y|x)dy

where μ(x)=E(YX=x)\mu(x) = \mathbb E(Y |X= x). For random variables XX and YY,
V(Y)=EV(YX)+VE(YX).\mathbb V(Y ) = \mathbb E\mathbb V(Y |X) + \mathbb V\mathbb E(Y |X).

Inequalities For Expectations

Some Statistics

In this part, we discuss basic of statistical inference.

Descriptive Statistics

Descriptive Statistics

Concept Description
Mean Average: xˉ=1nxi\bar{x} = \frac{1}{n} \sum x_i
Median Middle value (robust to outliers)
Mode Most frequent value
Variance Average squared deviation from mean
Standard Deviation Square root of variance
IQR Q3 - Q1 (middle 50% range)
Skewness Measures asymmetry
Kurtosis Measures "tailedness" or peak heaviness

When to use what:

Data Visualization (EDA)

Core Distributions

Distribution Use Case
Normal (N(μ,σ2)\mathcal{N}(\mu, \sigma^2)) Natural data, CLT, errors
Bernoulli/Binomial Yes/No events, coin flips
Poisson Count of events in fixed time (λ rate)
Exponential Time until next event (e.g., arrival times)
Uniform Equal likelihood over interval
Multivariate Normal Joint distribution over multiple features

Important Properties:

Concept Why It Matters in ML
Mean, Std Feature normalization, loss functions
Skewness, Outliers Scaling, robust modeling
Normal Distribution Linear models assume normality of errors
Poisson, Binomial Modeling counts and probabilities
Boxplots/Histograms Feature exploration & preprocessing

Statistical Inference

The main tools of inference: confidence intervals and tests of hypotheses. In a typical statistical problem, we have a random variable XX of interest, but its pdf f(x)f(x) or is not known. In fact either

Our information about the unknown distribution of XX or the unknown parameters of the distribution of XX comes from a sample on XX. A function T=T(X1,,Xn)T = T(X_1,\dots, X_n) of the sample is called a statistic.

A typical statistical inference question is:

Given a sample X1,...,XnFX_1, . . . , X_n ∼ F , how do we infer FF ?


There are many approaches to statistical inference. The two dominant approaches are called frequentist inference and Bayesian inference. A statistical model F\mathfrak F is a set of distributions (or densities or regression functions). A parametric model is a set of statistical model F\mathfrak F that can be parameterized by a finite number of parameters. In general, a parametric model takes the form
F={f(x;θ):θΘ}\mathfrak F= \{ f (x; θ) : θ ∈ Θ \}

where θθ is an unknown parameter (or vector of parameters) that can take values in the parameter space ΘΘ. A nonparametric model is a set F\mathfrak F that cannot be parameterized by a finite number of parameters. For example, FALL={all cdf ’s}\mathfrak F_{\text{ALL}} = \{\text{all cdf 's}\} is nonparametric.

For example, Suppose we observe pairs of data (X1,Y1),...,(Xn,Yn)(X_1, Y_1), . . ., (X_n, Y_n). Perhaps XiX_i is the blood pressure of subject ii and YiY_i is how long they live. XX is called a predictor or regressor or feature or independent variable. YY is called the outcome or the response variable or the dependent variable. We call r(x)=E(YX=x)r(x) = \mathbb E(Y |X= x) the regression function. If we assume that rFr ∈ \mathfrak F where F\mathfrak F is finite dimensional — the set of straight lines for example — then we have a parametric regression model. If we assume that rFr ∈ \mathfrak F where F\mathfrak F is not finite dimensional then we have a nonparametric regression model. The goal of predicting YY for a new patient based on their XX value is called prediction. If YY is discrete (for example, live or die) then prediction is instead called classification. If our goal is to estimate the function rr, then we call this regression or curve estimation. Regression models are sometimes written as

Y=r(X)+ϵY = r(X) + \epsilon

where E(ϵ)=0\mathbb E(\epsilon) = 0. Many inferential problems can be identified as being one of three types:

Point estimation refers to providing a single “best guess” of some quantity of interest (like mean, proportion, or variance) from sample data. The quantity of interest could be a parameter in a parametric model, a cdf FF, a probability density function ff, a regression function rr, or a prediction for a future value YY of some random variable. By convention, we denote a point estimate of θθ by θ^\hat θ or θ^n\hat θ_n. Remember that θθ is a fixed, unknown quantity. The estimate θ^\hat θ depends on the data so θ^\hat θ is a random variable.

More formally, let X1,...,XnX_1, . . . , X_n be nn iid data points from some distribution FF. A point estimator θ^n\hat θ_n of a parameter θθ is some function of X1,...,XnX_1, . . . , X_n:
θ^n=g(X1,...,Xn).\hat θ_n = g(X_1, . . . , X_n).

The bias of an estimator is defined by bias(θ^n)=E(θ^n)θ\text{bias}(\hat θ_n) = \mathbb E (\hat θ_n)− θ. We say that θ^n\hat θ_n is unbiased if E(θ^n)=θ\mathbb E(\hat θ_n) = θ. Many of the estimators we will use are biased. A reasonable requirement for an estimator is that it should converge to the true parameter value as we collect more and more data. This requirement is quantified by the following definition:

A point estimator θ^n\hat θ_n of a parameter θθ is consistent if θ^n  p  θ\hat \theta_n \xrightarrow[]{\; p \;}\theta, which means θ^n\hat \theta_n converges to θ\theta in probability. Equivalently, for every ϵ>0\epsilon >0,

p(XnX>ϵ)0p(|X_n - X| > \epsilon) \rightarrow 0

as nn \rightarrow \infty.

The distribution of θ^n\hat θ_n is called the sampling distribution. Statistic θ^n\hat \theta_n varies from sample to sample. This variability is captured by its sampling distribution. The standard deviation of θ^n\hat \theta_n is called the standard error: se=V(θ^n)se = \sqrt{\mathbb V(\hat \theta_n)}. Often, the standard error depends on the unknown FF. In those cases, sese is an unknown quantity but we usually can estimate it. The estimated standard error is denoted by se^\widehat {se}.

For example if X1,...,XnBernoulli(p)X_1, . . . , X_n ∼ \text{Bernoulli}(p) and let p^n=n1iXi\hat p_n = n^{−1} \sum_i X_i. Then E(p^n)=p\mathbb E (\hat p_n) = p so p^n\hat p_n is unbiased. The standard error is se=V(p^n)=p(1p)/nse = \sqrt{\mathbb V(\hat p_n)} = \sqrt{p(1− p)/n} . The estimated standard error is s^=p^n(1p^n)/n\hat s = \sqrt{\hat p_n(1− \hat p_n)/n} .

The quality of a point estimate is sometimes assessed by the mean squared error:

MSE(θ^)=E(θ^nθ)2MSE(\hat \theta) = \mathbb E(\hat \theta_n - \theta)^2

More specifically:

E(θ^nθ)2=(θ^n(x1,,xn)θ)2f(x1,,xn;θ)  dx1dxn=(θ^n(x1,,xn)θ)2i=1nf(xi;θ)  dx1dxn\begin{align*} \mathbb E(\hat \theta_n - \theta)^2 & = \int_{-\infty}^\infty \big(\hat\theta_n(x_1,\dots,x_n) - \theta \big)^2 f(x_1,\dots,x_n; \theta) \; dx_1\dots dx_n\\ & = \int_{-\infty}^\infty \big(\hat\theta_n(x_1,\dots,x_n) - \theta \big)^2 \prod_{i=1}^n f(x_i;\theta) \; dx_1\dots dx_n \end{align*}

It is easy to see that

E(θ^nθ)2=(E(θ^n)θ)2+E(θ^nθ)2=bias2(θ^n)+V(θ^n)\begin{align*} \mathbb E(\hat \theta_n - \theta)^2 & = (\mathbb E(\hat \theta_n) - \theta)^2 + \mathbb E(\hat \theta_n - \theta)^2 \\ & = \text{bias}^2(\hat\theta_n) + \mathbb V(\hat\theta_n) \end{align*}

That is,
MSE(θ^)=bias2(θ^n)+V(θ^n)\color{green}\boxed{MSE(\hat \theta) = \text{bias}^2(\hat\theta_n) + \mathbb V(\hat\theta_n)}

Many of the estimators we will encounter will turn out to have, approximately, a Normal distribution. An estimator is asymptotically Normal if

θ^nθseN(0,1)\frac{\hat\theta_n - \theta}{se} \rightsquigarrow \mathcal N(0,1)

which \rightsquigarrow represents convergence in distribution. We say XnXX_n \rightsquigarrow X if limnFn(t)=F(t)\lim_{n \rightarrow \infty} F_n(t) = F(t) at all tt for which FF is continuous.

Confidence Intervals

A 1α1− α confidence interval for a parameter θθ is an interval (an,bn)(a_n, b_n) where an=an(X1,...,Xn)a_n = a_n(X_1, . . . , X_n) and bn=bn(X1,...,Xn)b_n= b_n(X_1, . . . , X_n) are functions of the data such that
p(an<θ<bn)1α.p \big(a_n < θ< b_n \big) ≥ 1− α.

In words, (an,bn)(a_n, b_n) traps θθ with probability 1α1− α. We call 1α1− α the coverage of the confidence interval. Note that (an,bn)(a_n, b_n) is random and θθ is fixed. Commonly, people use 95% confidence intervals, which corresponds to choosing α=0.05α = 0.05. Interpretation of confidence interval can be stated as follows: “If we repeated the study 100 times, ~95 of the intervals would contain θ\theta.” If θθ is a vector then we use a confidence set (such as a sphere or an ellipse) instead of an interval.

In Bayesian methods we treat θθ as if it were a random variable and we do make probability statements about θθ. In particular, we will make statements like “the probability that θθ is in (an,bn)(a_n, b_n), given the data, is 95 percent.” However, these Bayesian intervals refer to degree- of-belief probabilities. These Bayesian intervals will not, in general, trap the parameter 95 percent of the time. As mentioned earlier, point estimators often have a limiting Normal distribution, that is, θ^nN(θ,s^2)\hat θ_n ≈ \mathcal N (θ, \hat s^2). In this case, we can construct (approximate) confidence intervals as follows.

Normal-based Confidence Interval

Suppose that θ^nN(θ,s^2)\hat θ_n ≈ \mathcal N (θ, \hat s^2). Let ϕ\phi be the cdf of a standard Normal and let
zα/2=ϕ1(1α/2),z_{α/2} = \phi^{−1}(1− α/2),

that is, p(Z>zα/2)=α/2p(Z > z_{α/2}) = α/2 and p(zα/2<Z<zα/2)=1αp(−z_{α/2} < Z < z_{α/2}) = 1− α where ZN(0,1)Z∼ \mathcal N (0, 1). Then

p(θ^nzα/2se^<θ<θ^n+zα/2se^)1α.p ( \hat\theta_n - z_{\alpha/2}\widehat {se} < \theta < \hat\theta_n + z_{\alpha/2}\widehat {se}) \rightarrow 1-\alpha.

This is because if we assume (θ^nθ)/se^ZN(0,1)(\hat\theta_n -\theta)/\widehat {se} \rightsquigarrow Z ∼ \mathcal N(0,1), then

p(θ^nzα/2se^<θ<θ^n+zα/2se^)=p(zα/2<θ^nθse^<zα/2)p(zα/2<Z<zα/2)=1α\begin{align*} p ( \hat\theta_n - z_{\alpha/2}\widehat {se} < \theta < \hat\theta_n + z_{\alpha/2}\widehat {se}) & = p( -z_{\alpha/2} < \frac{\hat\theta_n -\theta}{\widehat {se}} < z_{\alpha/2}) \\ & \rightarrow p(−z_{α/2} < Z < z_{α/2}) \\ & = 1 - \alpha \end{align*}

For 95% confidence intervals, α=0.05α = 0.05 and zα/2=1.962z_{α/2} = 1.96 ≈ 2 leading to the approximate 95% confidence interval θ^n±2se^\hat θ_n ± 2 \widehat {se}.

Estimating the cdf and Statistical Functionals

Let X1,...,XnX_1, . . ., X_n be a random sample on a random variable XX with cdf F(x)F(x). A histogram of the sample is an estimate of the pdf, f(x)f(x), of XX depending on whether XX is discrete or continuous. Here we make no assumptions on the form of the distribution of XX. In particular, we do not assume a parametric form of the distribution as we did for maximum likelihood estimates; hence, the histogram is often called a nonparametric estimator. Similarly, we can consider a nonparametric estimation of the cdf FF as well as the functions of cdf such as the mean, the variance, and the correlation.

The Empirical Distribution Function

Let X1,...,XnFX_1, . . . , X_n ∼ F be an iid sample where FF is a distribution function on the real line. We will estimate FF with the empirical distribution function, which is defined as follows.

F^n(x)=#{Xix}n.\hat F_n(x) = \frac{\# \{ X_i \le x\}}{n}.

The following results are from a mathematical theorem:

Plug-in Estimators

Many statistics are functions of FF such as

A plug-in estimator of a statistic θ=T(F)\theta = T(F) is defined by θ^n=T(F^n)\hat \theta_n = T(\hat F_n). In other words, just plug-in F^n\hat F_n for the unknown FF. Assume that somehow we can find an estimate se^\widehat {se}. In many cases, it turns out that T(Fn)N(T(F),se^2)T (F_n) ≈ N (T (F), \widehat {se}^2). An approximate 1α1− α confidence interval for T(F)T (F) is then T(Fn)±zα/2se^T (F_n) ± z_{α/2} \widehat {se}. We will call this the Normal-based interval. For a 95% confidence interval, zα/2=z.05/2=1.962z_{α/2} = z_{.05/2} = 1.96 ≈ 2 so the interval is T(Fn)±2se^T (F_n) ± 2 \widehat {se}.

Example (The Mean): Let μ=T(F)=xdF(x)\mu = T(F) = \int x dF(x). The plug-in estimator is

Example (The Variance): Let σ2=T(F)=V(X)=x2dF(x)(xdF(x))2σ^2 = T (F ) = \mathbb V(X) = \int x^2dF (x)− (\int xdF (x) )^2. The plug-in estimator is:

σ^2=x2dF^(x)(xdF^n(x))2=1niXi2(1niXi)2=1ni(XiXˉi)2.\begin{align*} \hat \sigma^2 & = \int x^2 d\hat F(x) - \Big(\int xd\hat F_n(x) \Big)^2 \\ & = \frac{1}{n} \sum_i X_i^2 - \Big(\frac{1}{n} \sum_i X_i \Big)^2 \\ & = \frac{1}{n} \sum_i \Big( X_i - \bar X_i\Big)^2. \end{align*}

Another reasonable estimator of σ2\sigma^2 is the sample variance
Sn2=1n1i(XiXˉi)2S^2_n = \frac{1}{n-1} \sum_i \Big( X_i - \bar X_i\Big)^2

In practice, there is little difference between σ2σ^2 and Sn2S^2_n and you can use either one. Returning to the last example, we now see that the estimated standard error of the estimate of the mean is se^=σ^/n\widehat {se} = \hat \sigma/ \sqrt n.

Example (Correlation). Let Z=(X,Y)Z= (X, Y ) and let
ρ=T(F)=E(XµX)(YµY)/(σxσy)ρ= T (F ) = \mathbb E(X−µ_X )(Y− µ_Y )/(σ_xσ_y )

denote the correlation between XX and YY, where F(x,y)F (x, y) is bivariate. We can write

ρ=a(T1(F),T2(F),T3(F),T4(F),T5(F))\rho = a(T_1(F ), T_2(F ), T_3(F ), T_4(F ), T_5(F ))

where
T1(F)=xdF(z),      T2(F)=ydF(z),      T3(F)=xydF(z),T4(F)=x2dF(z),      T5(F)=y2dF(z),\begin{align*} &T_1(F) = \int x dF(z), \;\;\; T_2(F) = \int y dF(z), \;\;\; T_3(F) = \int xy dF(z), \\ &T_4(F) = \int x^2 dF(z), \;\;\; T_5(F) = \int y^2 dF(z), \end{align*}

and

a(t1,,t5)=t3t1t2(t4t12)(t5t22)a(t_1,\dots,t_5) = \frac{t_3 - t_1t_2}{\sqrt{(t_4 - t_1^2)(t_5 - t_2^2)}}

Replace FF with F^n\hat F_n in T1(F),,T5(F)T_1(F), \dots, T_5(F) and take

ρ=a(T1(F^n),T2(F^n),T3(F^n),T4(F^n),T5(F^n))\rho = a(T_1(\hat F_n ), T_2(\hat F_n ), T_3(\hat F_n ), T_4(\hat F_n ), T_5(\hat F_n ))

We get
ρ^=i(XiXˉn)(YiYˉn)i(XiXˉn)2i(YiYˉn)2\hat ρ= \frac{\sum_i(X_i− \bar X_n)(Y_i− \bar Y_n)}{\sqrt{\sum_i(X_i - \bar X_n)^2}\sqrt{\sum_i (Y_i - \bar Y_n)^2}}

which is called the sample correlation.

Example (Quantiles). Let FF be strictly increasing with density ff. For 0<p<10 < p < 1, the ppth quantile is defined by T(F)=F1(p)T (F ) = F^{−1}(p). The estimate if T(F)T (F ) is Fn1(p)F^{−1}_n (p). We have to be a bit careful since FnF_n is not invertible. To avoid ambiguity we define
Fn1(p)=inf{x:F^n(x)p}F^{−1}_n (p) = inf \{x : \hat F_n(x) ≥ p\}

We call T(F^n)=F^n1(p)T(\hat F_n) = \hat F^{-1}_n (p) the ppth sample quantile.

Bootstrap

Bootstrap is a resampling technique used to estimate the distribution of a statistic (e.g., mean, median, variance, model accuracy) when the true sampling distribution is unknown or hard to derive analytically. It allows us to:

without strong parametric assumptions.

Let Tn=g(X1,...,Xn)T_n = g(X_1, . . . , X_n) be a statistic, that is, TnT_n is any function of the data. Suppose we want to know VF(Tn)\mathbb V_F (T_n), the variance of TnT_n. We have written VF\mathbb V_F to emphasize that the variance usually depends on the unknown distribution function FF. For example, if Tn=XˉnT_n = \bar X_n then VF(Tn)=σ2/n\mathbb V_F (T_n) = σ^2/n where σ2=(xµ)2dF(x)σ^2 = \int (x− µ)^2dF (x) and μ=xdF(x)\mu= \int xdF (x). Thus the variance of TnT_n is a function of FF. The bootstrap idea has two steps:

For Tn=XˉnT_n = \bar X_n, we have for Step 1 that VF^n(Tn)=σ^2/n\mathbb V_{\hat F_n} (T_n)= \hat σ^2/n where
σ^2=1ni(XiXˉi)2\hat σ^2 = \frac{1}{n} \sum_i \Big( X_i - \bar X_i\Big)^2

In this case, Step 1 is enough. However, in more complicated cases we cannot write down a simple formula for VF^n(Tn)\mathbb V_{\hat F_n} (T_n) which is why step 2 is needed. This is step is the bootstrap step which simply says to

This constitutes one draw from the distribution of TnT_n. We repeat these two steps mm times to get Tn,1,,Tn,mT_{n,1}^*, \dots, T_{n,m}^*. Now you have a empirical distribution of these Tn,iT_{n,i}^*s to estimate variance, standard error, confidence interval etc. For example, here is an example of using bootstrap to find the standard error for the median:

import numpy as np
from sklearn.utils import resample

data = np.array([3, 5, 7, 8, 12, 13, 14, 18, 21])
boot_medians = [np.median(resample(data)) for _ in range(10000)]
ci_lower, ci_upper = np.percentile(boot_medians, [2.5, 97.5])
print(f"95% CI for the median: ({ci_lower:.2f}, {ci_upper:.2f})")

In the context of data science or ML engineer, we can describe the bootstrap as follows: suppose you have a dataset D={x1,x2,,xn}D = \{ x_1, x_2, \dots, x_n \}.

  1. Resample with replacement:
    Generate mm new datasets D1,,DmD_1, \dots, D_m, each of size nn, drawn with replacement from DD
  2. Compute the statistic θ^i\hat\theta^*_i on each DiD_i
  3. Use the empirical distribution of these θ^i\hat \theta^*_i values to
    • Estimate the standard error
    • Build confidence intervals
    • Estimate bias or other metrics

Bootstrap works well when:

Parametric Inference

We now turn our attention to parametric models, that is, models of the form
F={f(x;θ):θΘ}\mathfrak F= \{ f (x; θ) : θ ∈ Θ \}
where the ΘRkΘ ⊂ R^k is the parameter space and θ=(θ1,...,θk)θ= (θ_1, . . . , θ_k ) is the parameter. The problem of inference then reduces to the problem of estimating the parameter θθ. You might ask: how would we ever know that the disribution that generated the data is in some parametric model? This is an excellent question. Indeed, we would rarely have such knowledge which is why nonparametric methods are preferable. Still, studying methods for parametric models is useful for two reasons. First, there are some cases where background knowledge suggests that a parametric model provides a reasonable approximation.

Maximum Likelihood

The most common method for estimating parameters in a parametric model is the maximum likelihood method. Let X1,...,XnX_1,. . ., X_n be iid with pdf f(x;θ)f (x; θ). The likelihood function is defined by
Ln(θ)=i=1nf(Xi;θ)\mathcal L_n(θ) = \prod_{i=1}^nf (Xi; θ)

The log-likelihood function is defined by n(θ)=logLn(θ)ℓ_n(θ) = \log \mathcal L_n(θ). The likelihood function is just the joint density of the data, except that we treat it is a function of the parameter θθ. Thus, Ln:Θ[0,)\mathcal L_n : Θ → [0, ∞). The likelihood function is not a density function: in general, it is not true that Ln(θ)\mathcal L_n(θ) integrates to 1 (with respect to θθ).

The maximum likelihood estimator (MLE), denoted by θ^n\hat θ_n, is the value of θθ that maximizes Ln(θ)\mathcal L_n(θ). The maximum of n(θ)ℓ_n(θ) occurs at the same place as the maximum of Ln(θ)\mathcal L_n(θ), so maximizing the log-likelihood leads to the same answer as maximizing the likelihood. Often, it is easier to work with the log-likelihood.

In some cases we can find the MLE θθ analytically in which frequently θ^n\hat θ_n solves the equation n(θ)θ=0\frac{\partial \ell_n (\theta)}{\partial \theta} = 0. If θθ is a vector of parameters, this results in a system of equations to be solved simultaneously. More often, we need to find the MLE by numerical methods. We will briefly discuss two commoused methods:

Both are iterative methods that produce a sequence of values θ0,θ1,...θ_0, θ_1, . . . that, under ideal conditions, converge to the MLE θθ.

Properties of Maximum Likelihood Estimators

Under certain conditions on the model, the maximum likelihood estimator θnθ_n possesses many properties that make it an appealing choice of estimator. The main properties of the MLE are:

Hypothesis Testing and p-values

Primary focus of inference is to learn about characteristics of the population given samples of that population. Probability theory is used as a basis for accepting/rejecting some hypotheses about the parameters of a population. Suppose that we partition the parameter space ΘΘ into two disjoint sets Θ0Θ_0 and Θ1Θ_1 and that we wish to test
H0:θΘ0    versus    H1:θΘ1.H_0 : θ ∈ Θ_0 \;\; \text{versus}\;\; H_1 : θ ∈ Θ_1.

We call H0H_0 the null hypothesis and H1H_1 the alternative hypothesis. Given a random variable XX whose range is X\mathcal X, we test a hypothesis about a test statistic TT related to variable XX by finding an appropriate subset of outcomes RXR ⊂ \mathcal X called the rejection region. If XRX ∈ R we reject the null hypothesis, otherwise, we do not reject the null hypothesis.

Retain Null Reject Null
H0H_0 true Type I Error
H1H_1 true Type II Error

Usually, the rejection region RR is of the form
R={x:T(x)>c}R = \{x: T(x) > c \}

where TT is a test statistic and cc is a critical value. The problem in hypothesis testing is to find an appropriate test statistic TT and an appropriate critical value cc.

Null hypothesis always states some expectation regarding a population parameter, such as population mean, median, standard deviation or variance. It is never stated in terms of expectations of a sample. In fact, sample statistics is rarely identical even if selected from the same population. For example, ten tosses of a single coin rarely results in 5 heads and 5 tails. The discipline of statistics sets rules for making an inductive leap from sample statistics to population parameters. Alternative hypothesis denies the null hypothesis. Note that null and alternative hypothesis are mutually exclusive and exhaustive; no other possibility exists. In fact, they state the opposite of each other. The null hypothesis can never be proven to be true by sampling. If you flipped a coin 1,000,000 times and obtained exactly 500,000 heads, wouldn’t that be a proof for fairness of the coin? No! It would merely indicates that, if a bias does exists, it must be exceptionally small.

Although we can not prove the null hypothesis, we can set up some conditions that permit us to reject it. For example, if we get 950,000 heads, would anyone seriously doubt the bias of the coin? Yes, we would reject the nut hypothesis that the coin is fair. The frame of reference for statistical decision making is provided by sampling distribution of a statistic. A sampling distribution is a theoretical probability distribution of the possible values of some sample statistic that would occur if we were to draw all possible samples of a fixed size from a given population. There is a sampling distribution for every statistic.

The level of significance α\alpha set by the investigator for rejecting is known as the alpha level. For example, if α=0.05\alpha=0.05 and test statistic is 1.43 where null hypothesis assumes chance model is normal distribution, then we fail to reject H0H_0 because test statistic does not achieve the critical value (1.96). But if α=0.01\alpha=0.01 and test statistic is 2.83, then we reject because test statistic is in the region of rejection (exceeds 2.58). Thus if α=0.05\alpha=0.05, about 5 times out of 100 we will falsely reject a true null hypothesis (Type I error).

Power of a test:

Probability of Type I error is α\alpha . Probability of Type II error is β\beta. The power of a test is the probability of correctly rejecting H0H_0, which is 1β1-\beta. So, high power means a low chance of missing a real effect. In order to achieve the desired power, we need to choose the right sample size for our testing.

Factors That Influence Power

Factor Effect on Power
Sample Size (n) ↑ Power increases with larger n
Effect Size (Δ) ↑ Bigger difference = easier to detect = ↑ Power
Significance Level (α) ↑ Loosening α (e.g. from 0.01 to 0.05) ↑ Power
Standard Deviation (σ) ↓ Less variability → ↑ Power
Test Type (1-sided vs 2-sided) 1-sided test has more power (but only if direction is correct)

Power increases with sample size, meaning you're more likely to detect real effects. Researchers often aim for: Power ≥ 0.80, meaning, 80% chance of detecting a true effect if it exists.

p-value

p_value (1st definition): the smallest Type I error you have to be willing to tolerate if you want to reject the null hypothesis. If p describes an error rate you find intolerable, you must retain the null. In other words, for those tests in which p <= p_value, you reject the null otherwise you retain the null.

For each αα we can ask: does our test reject H0H_0 at level αα? The p-value is the smallest αα at which we do reject H0H_0. If the evidence against H0H_0 is strong, the p-value will be small.

p-value evidence
< .01 very strong evidence against H0H_0
.01 – .05 strong evidence against H0H_0
.05 – .10 weak evidence against H0H_0
> .1 little or no evidence against H0H_0

Note that a large p-value is not strong evidence in favor of H0H_0. A large p-value can occur for two reasons:

Also do not confuse the p-value with p(H0Data)p(H_0|\text{Data}). The p-value is not the probability that the null hypothesis is true. This is wrong in two ways:

Equivalently, p-value can be defined as: The p-value is the probability (under H0H_0) of observing a value of the test statistic the same as or more extreme than what was actually observed. Informally, the p-value is a measure of the evidence against H0H_0: the smaller the p-value, the stronger the evidence against H0H_0. If the p_value is low (lower than the significance level) we say that it would be very unlikely to observe the data if the null hypothesis were true, and hence reject. Otherwise we would not reject . In this case the result of sampling is perhaps due to chance or sampling variability only.

How to calculate p-value

Hypothesis Testing often contains these steps:

  1. Set the hypothesis
  2. Calculate the point estimate from a sample
  3. Check the conditions (CLT conditions) if using CLT based tests
  4. Draw sampling distribution, shade p_value, calculate test statistic (ex., for mean, W=Xˉμs^/nW = \frac{\bar X -\mu}{\hat s/\sqrt n}),
  5. Make decision: based on p_value calculated, either reject or retain the null.

Choosing αα:

Level of Confidence for two-sided test is 1α1-\alpha but it is 12α1-2\alpha for one sided test.

tt_distribution:

When the sample size is large or the data is not too skewed, the sampling distribution is near normal and standard error sn\frac{s}{\sqrt n} is more accurate. If not, we address the uncertainty of standard error estimate by using tt_distribution. Specially when ss is not known, it better to use tt_distribution. For tt_distribution, observations are slightly more likely to fall beyond 2 SDs from the mean because it has ticker tails compared to normal distribution. As degrees of freedom increases, tt_distribution becomes more like normal.

For example, for estimating the mean using tt_distribution, we use
Xˉ±tdfs^n\bar X \pm t^*_{df} \frac{\hat s}{\sqrt n}

where df=n1df = n-1 for one sample mean test and ss is the sample variance. For inference for the comparison of two independent means, we use

Xˉ1Xˉ2±tdfs12n1+s22n2\bar X_1 - \bar X_2 \pm t^*_{df} \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}

where df=min(n11,n21)df = \min(n_1-1, n_2-1) and, s12s_1^2 and s22s_2^2 are sample variances.

Examples

We use test statistic to calculate the p_value. For example, suppose you have two samples obtained in different ways with sample means Xˉ=216.2\bar X=216.2 and Yˉ=195.3\bar Y= 195.3 and se^(μ^1)=5\widehat {se}(\hat \mu_1) = 5, se^(μ^2)=2.4\widehat {se}(\hat \mu_2) = 2.4. The null hypothesis is the default case which claims they are from the same populations so they should be the same. To test if the means are different, we compute
W=δ^0s^=XˉYˉs12m+s22n=216.2195.352+2.42=3.78W= \frac{\hat δ− 0}{\hat s} = \frac{\bar X - \bar Y}{\sqrt{\frac{s_1^2}{m} + \frac{s_2^2}{n}}} = \frac{216.2− 195.3}{\sqrt{5^2 + 2.4^2}} = 3.78

To compute the p-value, we consider zz-test. Let ZN(0,1)Z∼ \mathcal N (0, 1) and assume the conditions (CLT conditions) are met. Then,

p-value=p(Z>3.78)=2p(Z<3.78)=2ϕ1(3.78)=.0002\text{p-value} = p(|Z| > 3.78) = 2p(Z < - 3.78) = 2 \phi^{-1}(-3.78) = .0002

which is very strong evidence against the null hypothesis. To test if the medians are different, let ν1\nu_1 and ν2\nu_2 denote the sample medians. Then,

W=ν1ν2se^=212.51947.7=2.4W= \frac{\nu_1 - \nu_2}{\widehat {se}} = \frac{212.5− 194}{7.7} = 2.4

where the standard error se^=7.7\widehat {se} = 7.7 of ν1ν2\nu_1 - \nu_2 was found using the bootstrap. The p-value is

p-value=P(Z>2.4)=2P(Z<2.4)=.02\text{p-value} = P(|Z| > 2.4) = 2P(Z <−2.4) = .02

which is strong evidence against the null hypothesis. In the above examples, we have been relied on CLT-based tests (e.g. tt-test, Z-test) which is based on the Central Limit Theorem which implies the distribution of sample mean (or other suitable statistics) is nearly normal, centred at the population mean, and with a standard deviation equal to the population standard deviation divided by square root of the sample size. Distribution of sample statistic approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution, provided some conditions are met:

The χ2χ^2 Distribution

Let Z1,...,ZkZ_1, . . . , Z_k be independent, standard Normals. Let
V=i=1kZ2V=\sum_{i=1}^k Z^2

Then we say that VV has a χ2χ^2 distribution with kk degrees of freedom, written Vχk2V∼ χ_k^2. It can be shown that E(V)=k\mathbb E(V ) = k and V(V)=2k\mathbb V(V ) = 2k. Pearson’s χ2χ^2 test is used for multinomial data. Recall that if X=(X1,...,Xk)X= (X_1, . . . , X_k ) has a multinomial (n, p) distribution, then the mle of pp is p^=(p^1,...,p^k)=(X1/n,...,Xk/n)\hat p= (\hat p1, . . . , \hat pk) = (X_1/n, . . . , X_k /n). Let p0=(p01,...,p0k)p_0 = (p_{01}, . . . , p_{0k} ) be some fixed vector and suppose we want to test
H0:p=p0    versus    H1:pp0.H_0 : p= p_0 \;\; \text{versus}\;\; H_1 : p \ne p_0.

Pearson’s χ2χ^2 statistic is
T=j=1k(Xjnp0j)2np0j=j=1k(XjEj)2EjT = \sum_{j=1}^k \frac{(X_j - np_{0j})^2}{np_{0j}} = \sum_{j=1}^k \frac{(X_j - E_j)^2}{E_j}

where E(Xj)=Ej=np0j\mathbb E(X_j) = E_j = np_{0j} is the expected value of XjX_j under H0H_0. It is shown that under H0H_0, Tχk12T\rightsquigarrow \chi^2_{k-1} (given, k−1 of X^js and the mean, we can find the other one). Hence, the test: reject H0H_0 if T>χk1,α2T > \chi^2_{k-1, \alpha} has level α\alpha (means its probability should be α\alpha under H0H_0). The p-value is p(χk12>t)p(\chi^2_{k-1} > t) where tt is the observed value of the test statistic.

Example (Mendel’s peas). Mendel bred peas with round yellow seeds and wrinkled green seeds. There are four types of progeny: round yellow, wrinkled yellow, round green, and wrinkled green. The number of each type is multinomial with probability p=(p1,p2,p3,p4)p= (p1, p2, p3, p4). His theory of inheritance predicts that pp is equal to

p0(916,316,316,116)p_0 ≡ \Big( \frac{9}{16}, \frac{3}{16}, \frac{3}{16}, \frac{1}{16}\Big)

In n=556n = 556 trials he observed X=(315,101,108,32)X= (315, 101, 108, 32). We will test H0:p=p0H_0 : p= p_0 versus H1:pp0H_1 : p \ne p_0. Since, np01=312.75np_{01} = 312.75, np02=np03=104.25np_{02} = np_{03} = 104.25, and np04=34.75np_{04} = 34.75, the test statistic is

χ2=(315312.75)2312.75+(110104.25)2104.25+(108104.25)2104.25+(3234.75)234.75=0.47\begin{align*} \chi^2 = \frac{(315 - 312.75)^2}{312.75} &+ \frac{(110 - 104.25)^2}{104.25} \\ &+ \frac{(108 - 104.25)^2}{104.25} \\& + \frac{(32 - 34.75)^2}{34.75} = 0.47 \end{align*}

The α=.05\alpha = .05 value for a χ32\chi^2_3 is 7.815. Since 0.47 is not larger than 7.815 we do not reject the null. The p-value is
p-value=P(χ32>.47)=.93\text{p-value} = P(χ^2_3 > .47) = .93

which is not evidence against H0. Hence, the data do not contradict Mendel’s theory. This is how we use The Chi-Squared (χ2χ^2) test as non-parametric statistical test to evaluate whether observed categorical data differs significantly from what we would expect under some assumption. This is called goodness-of-fit test. A simple example of this is we throw a dice 60 times and we expect to have 10 of each 1,...,6. Now we look at a sample to test if our assumption was supported by data.

Independence Testing

Another use of χ2\chi^2 testing is the independence testing: Test whether two categorical variables are statistically independent (no relationship).

Example:
You survey 100 people about their gender and preferred pet, and organize it into a contingency table:

Cat Dog Total
Male 20 30 50
Female 10 40 50
Total 30 70 100

You can use a chi-squared test to see if pet preference is independent of gender.

Our test statistic is

i,j(OijEij)2Eij\sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

with degrees of freedom df=(r1)(c1)df = (r-1)(c-1) where rr is #rows and cc is #cols. Calculate EijE_{ij}s:

Cat Dog Total
Male 50×30100=15\frac{50×30}{100} = 15 50×70100=35\frac{50×70}{100} = 35 50
Female 50×30100=15\frac{50×30}{100} = 15 50×70100=35\frac{50×70}{100} = 35 50
Total 30 70 100

and find the test statistic:

Cell O E (O-E)² / E
Male–Cat 20 15 (2015)215=2515=1.67\frac{(20-15)^2}{15} = \frac{25}{15} = 1.67
Male–Dog 30 35 (3035)235=2535=0.714\frac{(30-35)^2}{35} = \frac{25}{35} = 0.714
Female–Cat 10 15 (1015)215=2515=1.67\frac{(10-15)^2}{15} = \frac{25}{15} = 1.67
Female–Dog 40 35 (4035)235=2535=0.714\frac{(40-35)^2}{35} = \frac{25}{35} = 0.714

So

χ2=1.67+0.714+1.67+0.714=4.768\chi^2 = 1.67+0.714+1.67+0.714= 4.768

Using a χ2χ^2 table or calculator:
At α = 0.05 and df = 1, the critical value is 3.84. Since 4.768 > 3.84, we reject the null hypothesis at the 5% level. There is evidence that gender and pet preference are not independent.

ANOVA (Analysis of Variance):

ANOVA is a statistical method used to compare means across multiple groups. It tells you whether at least one group mean is different — but not which one. Think of it as a generalization of the t-test, which only compares two groups.

ANOVA can be used when:

For example, suppose you’re testing if three fertilizers (A, B, C) lead to different average plant growths. You measure growth in cm for each group:

You want to test:
H0:μA=μB=μCH1:At least one μi differs\begin{align*} H_0&: μ_A = μ_B = μ_C \\ H_1&: \text{At least one $\mu_i$ differs} \end{align*}

ANOVA splits the total variability in the data into two parts:

If the between-group variance is large compared to the within-group variance, the group means likely differ.

ANOVA test statistic is the FF ratio:

F=Variability between groupsVariability within groups=MSGMSWF = \frac{\text{Variability between groups}}{\text{Variability within groups}} = \frac{MSG}{MSW}

where
MSG=SSGdfGMSW=SSWdfWSSG=i=1kni(yˉiyˉ)2SSW=i=1kj=1ni(yijyˉi)2dfG=k1dfW=Nk\begin{align*} MSG &= \frac{SSG}{df_G} \\ MSW &= \frac{SSW}{df_W}\\ SSG &= \sum_{i=1}^k n_i(\bar y_i - \bar y)^2\\ SSW &= \sum_{i=1}^k\sum_{j=1}^{n_i} (y_{ij} - \bar y_i)^2 \\ df_G & = k-1 \\ df_W &= N-k \end{align*}

Compare FF to critical value from FF-distribution with (k1,Nk)(k−1,N−k) degrees of freedom. Or use p-value: If p<αp < α (e.g., 0.05) → Reject H0H_0. Again, note that this analysis is assuming non-paired groups, i.e., groups are independent, approximately normal with roughly equal variance.

Bootstrapping for hypothesis testing

Bootstrapping is a powerful, non-parametric statistical technique used for:

The very powerful method as a substitution for CLT-base tests is bootstrapping. For example, finding confidence interval for median!. We bootstrap data to the size of the original data. This means sampling with replacement. This way we can obtain many sample of median and have an idea of its distribution (as histogram, for instance). For an approximate 90% confidence interval, we find 5% and 95% percentile. The desired interval is between this two numbers. This is called percentile method.

Using bootstrapping for hypothesis testing is similar. Suppose you have two groups. You want to test whether the means are significantly different.

Follow the steps:

Key Advantages of Bootstrapping

Decision Theory

Suppose we have an input vector xx together with a corresponding vector tt of target variables, and our goal is to predict tt given a new value for xx. For regression problems, tt will comprise continuous variables, whereas for classification problems tt will represent class labels. The joint probability distribution p(x,t)p(x, t) provides a complete summary of the uncertainty associated with these variables. Determination of p(x,t)p(x, t ) from a set of training data is an example of inference and is typically a very difficult problem whose solution forms the subject of much of this book. In a practical application, however, we must often make a specific prediction for the value of tt, or more generally take a specific action based on our understanding of the values tt is likely to take, and this aspect is the subject of decision theory.

Minimizing the Expected Loss for Classification

For many applications, our objective will be more complex than simply minimizing the number of misclassifications. That is why we introduce a loss function, also called a cost function, which is a single, overall measure of loss incurred in taking any of the available decisions or actions. Our goal is then to minimize the total loss incurred. Suppose that, for a new value of x\bm x, the true class is CkC_k and that we assign x\bm x to class CjC_j (where jj may or may not be equal to kk). In so doing, we incur some level of loss that we denote by LkjL_{kj}, which we can view as the k,jk, j element of a loss matrix. For a given input vector x\bm x, our uncertainty in the true class is expressed through the joint probability distribution p(x,Ck)p(\bm x, C_k) and so we seek instead to minimize the average loss, where the average is computed with respect to this distribution, which is given by

E[L]=kjRjLjk  p(x,Ck)dx=jRjkLjk  p(x,Ck)dx\begin{align*} \mathbb E[L] &= \sum_k \sum_j \int_{R_j} L_{jk} \; p(\bm x, C_k) d\bm x\\ &= \sum_j \int_{R_j} \sum_k L_{jk} \; p(\bm x, C_k) d\bm x \end{align*}

Each x\bm x can be assigned independently to one of the decision regions RjR_j. Our goal is to choose the regions RjR_j in order to minimize the expected loss, which implies that for each x\bm x, we should minimize kLjk  p(x,Ck)\sum_k L_{jk} \; p(\bm x, C_k). As before, we can use the product rule p(x,Ck)=p(Ckx)p(x)p(\bm x, C_k) = p(C_k \mid \bm x)p(\bm x) to eliminate the common factor of p(x)p(x). Thus the decision rule that minimizes the expected loss is the one that assigns each new x\bm x to the class jj for which the quantity
kLkjp(Ckx)\sum_k L_{kj}p(C_k\mid \bm x)

is a minimum. This is clearly trivial to do, once we know the posterior class probabilities p(Ckx)p(C_k\mid \bm x).

Rejection Option

Classification errors arise from the regions of input space where the largest of the posterior probabilities p(Ckx)p(C_k\mid \bm x) is significantly less than unity, or equivalently where the joint distributions p(x,Ck)p(\bm x, C_k) have comparable values. These are the regions where we are relatively uncertain about class membership. In some applications, it will be appropriate to avoid making decisions on the difficult cases in anticipation of a lower error rate on those examples for which a classification decision is made. This is known as the reject option. We can achieve this by introducing a threshold θθ and rejecting those inputs x\bm x for which the largest of the posterior probabilities p(Ckx)p(C_k\mid \bm x) is less than or equal to θθ. Note that setting θ=1θ = 1 will ensure that all examples are rejected, whereas if there are KK classes then setting θ<1/Kθ < 1/K will ensure that no examples are rejected. Thus the fraction of examples that get rejected is controlled by the value of θθ.

drawing

Minimizing the Expected Loss for Regression

So far, we have discussed decision theory in the context of classification problems. We now turn to the case of regression problems, such as the curve fitting example discussed earlier. The decision stage consists of choosing a specific estimate y(x)y(\bm x) of the value of tt for each input x\bm x. Suppose that in doing so, we incur a loss L(t,y(x))L(t, y(\bm x)). The average, or expected, loss is then given by

E[L]=L(t,y(x))p(x,t)dxdt\mathbb E[L] = \int\int L(t, y(\bm x)) p(\bm x, t) d\bm xdt

A common choice of loss function in regression problems is the squared loss given by L(t,y(x))=(y(x)t)2L(t, y(\bm x)) =\big( {y(\bm x)− t}\big) ^2:

E[L]=(y(x)t)2p(x,t)dxdt\mathbb E[L] = \int\int \big(y(\bm x)− t)^2 p(\bm x, t\big) d\bm xdt

Our goal is to choose y(x)y(\bm x) so as to minimize E[L]\mathbb E[L]. It turns out that the optimal answer to this problem is y(x)=E[tx]y(\bm x)= \mathbb E[ t|\bm x]. To see this, we can expand the square term as follows

{y(x)t}2={y(x)E[tx]+E[tx]t}2={y(x)E[tx]}2+2{y(x)E[tx]}{E[tx]t}+{E[tx]t}2\begin{align*} \{y(\bm x)− t\}^2 &= \{y(\bm x)− \mathbb E[ t\mid \bm x] + \mathbb E[ t\mid \bm x] - t \}^2 \\ &= \{y(\bm x) − \mathbb E[ t\mid \bm x] \}^2 \\ & \qquad + 2 \{y(\bm x) - \mathbb E[ t\mid \bm x] \} \{\mathbb E[ t\mid \bm x] - t \} \\ & \qquad + \{\mathbb E[ t\mid \bm x] - t \}^2 \end{align*}

Substituting into the loss function and performing the integral over tt, we see that the cross-term vanishes and we obtain an expression for the loss function in the form

E[L]={y(x)E[tx]}2p(x)dx+{E[tx]t}2p(x,t)dxdt\begin{align*} \mathbb E[L] & = \int \{y(\bm x) − \mathbb E[ t|\bm x] \}^2 p(\bm x)d\bm x + \int \{\mathbb E[ t\mid \bm x] - t \}^2 p(\bm x, t)d\bm x dt % \int \text{Var}(t \mid \bm x) p(\bm x)d\bm x \end{align*}

The function y(x)y(\bm x) we seek to determine enters only in the first term, which will be minimized when y(x)=E[tx]y(\bm x)= \mathbb E[ t|\bm x], in which case this term will vanish. rgets, and is called the Bayes error. The estimator y(x)=Et[tx]y(\bm x)= \mathbb E_t[ t|\bm x] is the best we can ever hope to do with any learning algorithm. This is simply the result that we derived previously and that shows that the optimal least squares predictor is given by the conditional mean. The second term (called Bayes error) is the variance of the distribution of tt, averaged over x\bm x:
Var(tx)p(x)dx\int \text{Var}(t \mid \bm x) p(\bm x)d\bm x

It represents the intrinsic variability of the target data and can be regarded as noise. Because it is independent of y(x)y(\bm x), it represents the irreducible minimum value of the loss function.

Entropy

Considere a discrete random variable x\bm x and we ask how much information is received when we observe a specific value for this variable. The amount of information can be viewed as the ‘degree of surprise’ on learning the value of x\bm x. If we are told that a highly improbable event has just occurred, we will have received more information than if we were told that some very likely event has just occurred, and if we knew that the event was certain to happen we would receive no information. Our measure of information content will therefore depend on the probability distribution p(x)p(x), and we therefore look for a quantity h(x)h(x) that is a monotonic function of the probability p(x) and that expresses the information content. Note that if we have two events xx and yy that are unrelated, then the information gain from observing both of them should be the sum of the information gained from each of them separately, so that h(x,y)=h(x)+h(y)h(x, y) = h(x) + h(y). Two unrelated events will be statistically independent and so p(x,y)=p(x)p(y)p(x, y) = p(x)p(y). From these two relationships, it is easily shown that h(x)h(x) must be given by the logarithm of p(x)p(x): h(x)=logp(x)h(x) = -\log p(x). Note that low probability events xx correspond to high information content. The average amount of information is obtained by taking the expectation with respect to the distribution p(x)p(x):

H[x]=xp(x)logp(x).H[x] =− \sum_x p(x) \log p(x).

This important quantity is called the entropy of the random variable xx. Note that limp0plnp=0\lim_{p→0} p \ln p = 0 and so we shall take p(x)lnp(x)=0p(x) \ln p(x) = 0 whenever we encounter a value for x such that p(x)=0p(x) = 0. Distributions p(x)p(x) that are sharply peaked around a few values will have a relatively low entropy, whereas those that are spread more evenly across many values will have higher entropy. For example, when one of p(xi)p(x_i) is 1 and the rest is zero, entropy is at its minimum value 0. But if p(x1)==p(xn)=1/np(x_1)=\dots=p(x_n)=1/n (all equal), the entropy is at maximum value nn. Entropy defintion for continuous variables is similar:

H[x]=p(x)logp(x)dx.H[x] = -\int p(x) \log p(x) dx.

The cross entropy between two probability distributions pp and qq is defined as H(p,q)=xp(x)logq(x)H(p, q) = − \sum_x p(x) \log q(x).

Linear Models for Regression

The simplest form of linear regression models are also linear functions of the input variables. However, we can obtain a much more useful class of functions by taking linear combinations of a fixed set of nonlinear functions of the input variables, known as basis functions. Such models are linear functions of the parameters, which gives them simple analytical properties, and yet can be nonlinear with respect to the input variables. The simplest linear model for regression is one that involves a linear combination of the input variables

y(x,w)=w0+w1x1+...+wDxDy(\bm x,\bm w) = w_0 + w_1x_1 +. . . + w_Dx_D

where x=(x1,...,xD)T\bm x = (x_1, . . . , x_D)^T. This is often simply known as linear regression. The key property of this model is that it is a linear function of the parameters w0,...,wDw_0, . . . , w_D. It is also, however, a linear function of the input variables xix_i, and this imposes significant limitations on the model. We therefore extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form

y(x,w)=w0+j=1M1wjϕj(x)y(\bm x,\bm w) = w_0 +\sum_{j=1}^{M−1} w_j \phi_j(\bm x)

where ϕj(x):RnR\phi_j(\bm x): \mathbb R^n \rightarrow \mathbb R are known as basis functions. By denoting the maximum value of the index jj by M1M− 1, the total number of parameters in this model will be MM. The parameter w0w_0 allows for any fixed offset in the data and is sometimes called a bias parameter (not to be confused with ‘bias’ in a statistical sense). It is often convenient to define an additional dummy ‘basis function’ φ0(x)=1φ_0(x) = 1 so that

y(x,w)=j=0M1wjϕj(x)=wTϕ(x)y(\bm x,\bm w) = \sum_{j=0}^{M−1} w_j \phi_j(\bm x) = \bm w^T \bm \phi(\bm x)

where w=(w0,...,wM1)T\bm w = (w_0, . . . , w_{M−1})^T and ϕ=(ϕ0,...,ϕM1)T\bm \phi = (\phi_0, . . . , \phi_{M−1})^T. The example of polynomial regression mentioned before is a particular example of this model in which there is a single input variable xx, and the basis functions take the form of powers of xx so that ϕj(x)=xj\phi_j(x) = x^j. One limitation of polynomial basis functions is that they are global functions of the input variable, so that changes in one region of input space affect all other regions. There are many other possible choices for the basis functions, for example

ϕj(x)=exp{xμj22s2}\phi_j(\bm x) = \exp\{-\frac{||\bm x-\bm \mu_j||^2}{2s^2} \}

where the μj\bm \mu_j govern the locations of the basis functions in input space, and the parameter ss governs their spatial scale. These are usually referred to as Gaussian basis functions, although it should be noted that they are not required to have a probabilistic interpretation, and in particular the normalization coefficient is unimportant because these basis functions will be multiplied by adaptive parameters wjw_j. Another possibility is the sigmoidal basis function of the form

ϕj(x)=σ(xμjs)\phi_j(x) = \sigma \Big (\frac{x-\mu_j}{s} \Big)

where σ(a)σ(a) is the sigmoid function. Yet another possible choice of basis function is the Fourier basis, which leads to an expansion in sinusoidal functions. Each basis function represents a specific frequency and has infinite spatial extent. Most of the discussion in this chapter, however, is independent of the particular choice of basis function set, and so for most of our discussion we shall not specify the particular form of the basis functions including the identity ϕ(x)=x\bm \phi(\bm x) =\bm x.

Maximum Likelihood and Least Squares

As before, we assume that the target variable tt is given by a deterministic function y(x,w)y(\bm x,\bm w) with additive Gaussian noise so that

t=y(x,w)+ϵ,t = y(\bm x,\bm w) + \epsilon,

where ϵ\epsilon is a zero mean Gaussian random variable with precision (inverse variance) ββ. Thus we can write

p(tx,w,β)=N(ty(x,w),β1).p(t\mid \bm x, \bm w, β) = \mathcal N (t\mid y(\bm x,\bm w), β^{−1}).

Here we have defined a precision parameter β corresponding to the inverse variance of the distribution β1=σ2\beta^{-1}=\sigma^2.

drawing

Recall that, if we assume a squared loss function, then the optimal prediction, for a new value of x\bm x, will be given by the conditional mean of the target variable. In the case of a Gaussian conditional distribution of the form, the conditional mean will be simply

E[tx]=tp(tx)dt=y(x,w).\mathbb E[t\mid \bm x] = \int tp(t\mid \bm x) dt= y(\bm x, \bm w).

Note that the Gaussian noise assumption implies that the conditional distribution of tt given x\bm x is unimodal, which may be inappropriate for some applications. An extension to mixtures of conditional Gaussian distributions, which permit multimodal conditional distributions, will be discussed later.

Now consider a dataset of inputs X={x1,...,xN}\bm X= \{ \bm x_1, . . . ,\bm x_N \} with corresponding target values t=(t1,...,tN)\bm t = (t_1, . . . , t_N) . Making the assumption that these data points are drawn independently from the distribution (equivalently, ϵiϵ_i are distributed IID), we obtain the following expression for the likelihood function, which is a function of the adjustable parameters w\bm w and ββ, in the form

p(tX,w,β)=n=1NN(tnwTϕ(xn),β1)p(\bm t \mid \bm X, \bm w, \beta) = \prod_{n=1}^N \mathcal N(t_n \mid \bm w^T \bm \phi(\bm x_n), \beta^{-1})

Note that in supervised learning problems such as regression and classification, we are not seeking to model the distribution of the input variables. Thus x\bm x will always appear in the set of conditioning variables, and so from now on we will drop the explicit x\bm x from expressions to keep the notation uncluttered. Taking the logarithm of the likelihood function, and making use of the standard form for the univariate Gaussian, we have

lnp(tw,β)=n=1NlnN(tnwTϕ(xn),β1)=N2lnβN2ln2πβED(w)\begin{align*} \ln p(\bm t \mid \bm w, \beta) &= \sum_{n=1}^N \ln \mathcal N(t_n \mid \bm w^T \bm \phi(\bm x_n), \beta^{-1}) \\ & = \frac{N}{2} \ln \beta - \frac{N}{2} \ln 2\pi - \beta E_D(\bm w) \end{align*}

where

ED(w)=12n=1N{tnwTϕ(xn)}2=12(tΦw)T(tΦw)E_D(\bm w) = \frac{1}{2} \sum_{n=1}^N \{ t_n - \bm w^T \bm \phi(\bm x_n)\}^2 = \frac{1}{2} (\bm t - \bm \Phi \bm w)^T(\bm t - \bm \Phi \bm w)

Note: the terms “probability” and “likelihood” have different meanings in statistics: given a statistical model with some parameters θθ, the word “probability” is used to describe how plausible a future outcome xx is (knowing the parameter values θθ), while the word “likelihood” is used to describe how plausible a particular set of parameter values θθ are, after the outcome xx is known.

Having written down the likelihood function, we can use maximum likelihood to determine w\bm w and ββ. Consider first the maximization with respect to w\bm w. We see that maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function given by ED(w)E_D(\bm w). The gradient of the log likelihood function takes the form

wlnp(tw,β)=n=1Nϕ(xn)T{tnwTϕ(xn)}=ΦT(tΦw),w2lnp(tw,β)=ΦTΦ.\begin{align*} \nabla_{\bm w} \ln p(\bm t \mid \bm w, \beta) &= \sum_{n=1}^N \bm \phi(\bm x_n)^T \{ t_n - \bm w^T \bm \phi(\bm x_n)\} = \bm \Phi^T(\bm t - \bm \Phi \bm w), \\ \nabla^2_{\bm w} \ln p(\bm t \mid \bm w, \beta) &= \bm \Phi^T \bm \Phi. \end{align*}

Setting this gradient to zero and solving for w\bm w, we obtain:

wML=(ΦTΦ)1ΦTt\begin{align*} \bm w_{ML} = (\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T \bm t \end{align*}

which are known as the Normal Equation for the least squares problem. Here ΦΦ is an N×M matrix, called the design matrix, whose elements are given by Φnj=φj(xn)Φ_{nj}= φ_j(x_n), so that

Φ=(ϕ0(x1)ϕ0(x1)ϕM1(x1)ϕ0(x2)ϕ0(x2)ϕM1(x2)ϕ0(xN)ϕ0(xN)ϕM1(xN))\Phi = \begin{pmatrix} \phi_0(\bm x_1) & \phi_0(\bm x_1) &\dots& \phi_{M-1}(\bm x_1) \\ \phi_0(\bm x_2) & \phi_0(\bm x_2) &\dots& \phi_{M-1}(\bm x_2) \\ \vdots & \vdots & \dots & \vdots\\ \phi_0(\bm x_N) & \phi_0(\bm x_N) &\dots& \phi_{M-1}(\bm x_N) \end{pmatrix}

The quantity Φ=(ΦTΦ)1ΦT\bm \Phi^{\dagger} = (\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T is known as the Moore-Penrose pseudo-inverse of the matrix. It can be regarded as a generalization of the notion of matrix inverse to nonsquare matrices. We can also maximize the log likelihood function with respect to the noise precision parameter ββ, giving

1βML=1Nn=1N(y(xn,wML)tn)2=1Nn=1N{tnwMLTϕ(xn)}2\begin{align*} \frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^N \big( y(x_n, \bm w_{ML}) - t_n\big)^2 = \frac{1}{N}\sum_{n=1}^N \{ t_n - \bm w^T_{ML} \bm \phi(\bm x_n)\}^2 \end{align*}

and so we see that the inverse of the noise precision, that is 1Nn=1N{tnwMLTϕ(xn)}2\frac{1}{N}\sum_{n=1}^N \{ t_n - \bm w^T_{ML} \bm \phi(\bm x_n)\}^2 is given by the residual variance of the target values around the regression function and its being minimized. Geometrical interpretation of the least-squares solution is the following: the least-squares regression function is obtained by finding the orthogonal projection y=wMLTΦ\bm y = \bm w^T_{ML} \bm \Phi of the data vector t=(t1,...,tN)\bm t=(t_1, . . . , t_N) onto the lower-dimensional subspace spanned by feature vectors φj=(ϕj(x1),,ϕj(xN))\bm φ_j = (\phi_j(\bm x_1),\dots, \phi_j(\bm x_N)). The reason for orthogonality is that this projection minimizes ty=twMLTΦ||\bm t - \bm y || = || \bm t - \bm w^T_{ML} \bm \Phi ||.

drawing

Having determined the parameters w\bm w and ββ, we can now make predictions for new values of xx:

p(tx,wML,βML)=N(ty(x,wML),βML1)p( t \mid x,\bm w_{ML}, \beta_{ML}) = \mathcal{N}(t \mid y(x, \bm w_{ML}), \beta^{-1}_{ML})

So far, we have considered the case of a single target variable tt. In some applications, we may wish to predict K > 1 target variables, which we denote collectively by the target vector tt. This could be done by introducing a different set of basis functions for each component of tt, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x,w)=WTϕ(x)y(\bm x, \bm w) = \bm W^T \bm \phi(\bm x) where W\bm W is a matrix. Everything goes similar to single output tt:

WML=(ΦTΦ)1ΦTT\bm W_{ML} = (\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T \bm T

If we examine this result for each target variable tkt_k, we have wk=(ΦTΦ)1ΦTtk\bm w_k = (\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T \bm t_k. Thus the solution to the regression problem decouples between the different target variables, and we need only compute a single pseudo-inverse matrix ΦΦ^†, which is shared by all of the vectors wk\bm w_k.

The extension to general Gaussian noise distributions having arbitrary covariance matrices is straightforward. This leads to a decoupling into K independent regression problems. This result is unsurprising because the parameters W\bm W define only the mean of the Gaussian noise distribution, and we know that the maximum likelihood solution for the mean of a multivariate Gaussian is independent of the covariance. From now on, we shall therefore consider a single target variable tt for simplicity.

Hypothesis Testing

Up to now we have made minimal assumptions about the true distribution of the data. We now assume that the observations tit_i are uncorrelated and we chose them to have precision β\beta, or constant variance 1β\frac{1}{\beta}. Recal that wML\bm w_{ML} was an unbiased estimator of w\bm w becuase its expectations (conditioned on XX) is the true paramter $\bm w:

E[wML]=E[(ΦTΦ)1ΦTt]=(ΦTΦ)1ΦTE[t]=(ΦTΦ)1ΦTΦw=w\begin{align*} \mathbb E[\bm w_{ML}] = \mathbb E [(\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T \bm t] = (\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T \mathbb E [\bm t] = (\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T \bm \Phi \bm w = \bm w \end{align*}

The variance–covariance matrix of the least squares parameter estimates wML\bm w_{ML} is easily derived from its defining equation:

Var(wML)=((ΦTΦ)1ΦT)Var(t)((ΦTΦ)1ΦT)T=1β(ΦTΦ)1\text{Var}(\bm w_{ML}) = \Big((\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T \Big) \text{Var}(\bm t) \Big ((\bm \Phi^T \bm \Phi)^{-1} \bm \Phi^T\Big)^T = \frac{1}{\beta}(\bm \Phi^T \bm \Phi)^{-1}

because Var(t)=1βI\text{Var}(\bm t) = \frac{1}{\beta} \bm I. Note that 1β=σ2\frac{1}{\beta} = \sigma^2. We gave the maximum likelihood estimate of β\beta before which was biased. Typically one estimates the variance 1β\frac{1}{\beta} by

σ^2=1β^=1NMn=1N{tnwMLTϕ(xn)}2\hat \sigma^2 = \frac{1}{\hat \beta} = \frac{1}{N-M}\sum_{n=1}^N \{ t_n - \bm w^T_{ML} \bm \phi(\bm x_n)\}^2

The NMN−M rather than NN in the denominator makes σ^2\hat \sigma^2 an unbiased estimate of σ2\sigma^2. Therfore, we can say: wMLN(w,1β(ΦTΦ)1)\bm w_{ML} \sim \mathcal N(\bm w, \frac{1}{\beta}(\bm \Phi^T \bm \Phi)^{-1}). Assuming tit_is are independent, then NMβ^1βχNM2\frac{N-M}{\hat \beta} \sim \frac{1}{\beta} \chi^2_{N-M}, a chi-squared distribution with NMN−M degrees of freedom. In addition wML\bm w_{ML} and 1β\frac{1}{\beta} are statistically independent. We use these distributional properties to form tests of hypothesis and confidence intervals for the parameters wMLj\bm w^j_{ML}. For example, to test the hypothesis that a particular coefficient wj=0w_j = 0, we form the standardized coefficient or zz-score:

zj=wMLj0σ^vjz_j = \frac{w^j_{ML} - 0}{\hat \sigma \sqrt{v_j}}

where vjv_j is the jjth diagonal element of (ΦTΦ)1(\Phi^T \Phi)^{−1}. Under the null hypothesis that wj=0w_j = 0, zjz_j is distributed as tNMt_{N−M} (a tt distribution with NMN−M degrees of freedom), and hence a large (absolute) value of zjz_j will lead to rejection of this null hypothesis. If σ^\hat σ is replaced by a known value σσ, then zjz_j would have a standard normal distribution. The difference between the tail quantiles of a t-distribution and a standard normal become negligible as the sample size increases, and so we typically use the normal quantiles.

Training Models

The Normal Equation computes the inverse of XTXX^T X, which is an (n + 1) × (n + 1) matrix (where n is the number of features). The computational complexity of inverting such a matrix is typically about O(n2.4)\mathcal O(n^{2.4}) to O(n3)\mathcal O(n^3) (depending on the implementation). In practice, a direct solution of the normal equations can lead to numerical difficulties when ΦTΦΦ^TΦ is close to singular. In particular, when two or more of the basis vectors ϕjϕ_j are colinear (perfectly correlated), or nearly so, the resulting parameter values can have large magnitudes or not uniquely defined. However, the fitted values are still the projection of t\bm t onto the space of ϕjϕ_js; there would just be more than one way to express that projection in terms of ϕjϕ_js. Such near degeneracies will not be uncommon when dealing with real datasets. Note that the addition of a regularization term ensures that the matrix is non-singular, even in the presence of degeneracies.

The pseudoinverse itself is computed using a standard matrix factorization technique called Singular Value Decomposition (SVD) that can decompose the training set matrix XX into the matrix multiplication of three matrices UΣVTU Σ V^T (see numpy.linalg.svd()). The pseudoinverse is computed as X+=VΣ+UTX^+ = VΣ^+U^T. To compute the matrix Σ+Σ^+, the algorithm takes ΣΣ and sets to zero all values smaller than a tiny threshold value, then it replaces all the non-zero values with their inverse, and finally it transposes the resulting matrix. This approach is more efficient than computing the Normal Equation, plus it handles edge cases nicely: indeed, the Normal Equation may not work if the matrix XTXX^TX is not invertible (i.e., singular), such as if m < n or if some features are redundant, but the pseudo-inverse is always defined. The SVD approach used by Scikit-Learn’s LinearRegression class is about O(n2)\mathcal O(n^2).

Gradient Descent

In machine learning the more common way to optimize the objective function is to use iterative algorithms such as Gradient descent. Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function. An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time.

w(τ+1)w(τ)αwED(w)\bm w^{(\tau+1)} \leftarrow \bm w^{(\tau)} - \alpha \nabla_{\bm w} E_D(\bm w)

Learning rate typically with small values e.g. 0.01 or 0.0001. On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge. With gradient descent, we never actually reach the optimum, but merely approach it gradually. Why, then, would we ever prefer gradient descent? Two reasons:

  1. We can only solve the system of equations in closed-form like Normal Equations for a handful of models. By contrast, we can apply gradient descent to any model for which we can compute the gradient. Importantly, this can usually be done automatically, so software packages like Theano and TensorFlow can save us from ever having to compute partial derivatives by hand.

  2. Solving a large system of linear equations can be expensive (matrix inversion is an O(D3)\mathcal O(D^3) algorithm), possibly many orders of magnitude more expensive than a single gradient descent update. Therefore, gradient descent can sometimes find a reasonable solution much faster than solving the linear system. Therefore, gradient descent is often more practical than computing exact solutions, even for models where we are able to derive the latter.

To implement algorithms in Python, we vectorize algorithms by expressing them in terms of vectors and matrices (using Numpy or deep learning libraries, for example). This way, the equations, and the code, will be simpler and more readable. Also we get rid of dummy variables/indices! Vectorized code is much faster. It cuts down on Python interpreter overhead. It uses highly optimized linear algebra libraries, fast matrix multiplication on a Graphics Processing Unit (GPU).

To implement Gradient Descent, compute the gradient of the cost function with regards to each model parameter. This could involve calculations over the full training set XX, at each Gradient Descent step! This algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step. As a result it is terribly slow on very large training sets. However, Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation or SVD decomposition.

Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration. When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does. To help SGD converges despite all the flunctuation due to it stochastic nature, we gradually reduce the learning rate. The function that determines the learning rate at each iteration is called the learning schedule. If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early.

When using Stochastic Gradient Descent, the training instances must be independent and identically distributed (IID), to ensure that the parameters get pulled towards the global optimum, on average. A simple way to ensure this is to shuffle the instances during training (e.g., pick each instance randomly, or shuffle the training set at the beginning of each epoch). If you do not do this, for example if the instances are sorted by label, then SGD will start by optimizing for one label, then the next, and so on, and it will not settle close to the global minimum.

Another very common approach is Mini-batch Gradient Descent: at each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini-batch GD computes the gradients on small random sets of instances. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

Early Stopping

As the epochs go by, the algorithm learns and its prediction error (RMSE) on the training set naturally goes down, and so does its prediction error on the validation set. However, after a while the validation error stops decreasing and actually starts to go back up. This indicates that the model has started to overfit the training data. With early stopping you just stop training as soon as the validation error reaches the minimum. It is such a simple and efficient regularization technique.

drawing

Batch and Online Learning

Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data. In batch learning, the system is incapable of learning incrementally; it must be trained using all the available data which generally takes a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore (offline learning ). If you want a batch learning system to know about new data (such as a new type of spam), you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one. If your system needs to adapt to rapidly changing data (e.g., to predict stock prices), then you need a more reactive solution. Also, training on the full set of data requires a lot of computing resources (CPU, memory space, disk space, disk I/O, network I/O, etc.). If you have a lot of data and you automate your system to train from scratch every day, it will end up costing you a lot of money. If the amount of data is huge, it may even be impossible to use a batch learning algorithm.

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives. Online learning is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. It is also a good option. A big challenge with online learning is that if bad data is fed to the system, the system’s performance will gradually decline. If we are talking about a live system, your clients will notice.

Regularized Least Squares

The idea of adding a regularization term to an error function in order to control over-fitting to improve generalization is common practice. So the total error to be minimized is ED(w)+λEW(w)E_D(\bm w) + \lambda E_W(\bm w) where λλ is the regularization coefficient that controls the relative importance of the data-dependent error ED(w)E_D(\bm w) and the regularization term EW(w)E_W (\bm w). One of the simplest forms of regularizer is given by the sum-of-squares of the weight vector elements EW(w)=12wTwE_W(\bm w) = \frac{1}{2}\bm w^T \bm w. So the total error function becomes:

12n=1N{tnwTϕ(xn)}2+λ2wTw\frac{1}{2} \sum_{n=1}^N \{ t_n - \bm w^T \bm \phi(\bm x_n)\}^2 + \frac{\lambda}{2}\bm w^T \bm w

This particular choice of regularizer encourages weight values to decay towards zero, unless supported by the data. It has the advantage that the error function remains a quadratic function of w\bm w, and so its exact minimizer can be found in closed form. Specifically, setting the gradient with respect to w\bm w to zero, and solving for w\bm w as before, we obtain:

w=(λI+ΦTΦ)1ΦTt\bm w = (λ\bm I + \bm Φ^T\bm Φ)^{−1} \bm Φ^T \bm t

The solution adds a positive constant to the diagonal of ΦTΦ\bm Φ^T\bm Φ before inversion. This makes the problem nonsingular, even if ΦTΦ\bm Φ^T\bm Φ is not of full rank, and was the main motivation for ridge regression when it was first introduced in statistics (Hoerl and Kennard, 1970). A more general regularizer is sometimes used, for which the regularized error takes the form

12n=1N{tnwTϕ(xn)}2+λ2j=1Mwjq\frac{1}{2} \sum_{n=1}^N \{ t_n - \bm w^T \bm \phi(\bm x_n)\}^2 + \frac{\lambda}{2} \sum_{j=1}^M | w_j|^q

where q=2q = 2 corresponds to the quadratic regularizer. The case of q=1q = 1 is known as the lasso in the statistics literature . This is L1-regularizetion that encourages some of the coefficients wjw_j to be exactly zero if λλ is sufficiently large, leading to a sparse model in which the corresponding basis functions play no role. To see this, we first note that minimizing the above objective is equivalent to minimizing the unregularized sum-of-squares error subject to the constraint

j=1Mwjqη\sum_{j=1}^M | w_j|^q \le \eta

for an appropriate value of the parameter ηη, where the two approaches can be related using Lagrange multipliers. L1-regularizetion is useful in situations where you have lots of features, but only a small fraction of them are likely to be relevant (e.g. genetics). The above cost function is a quadratic program, a more difficult optimization problem than for L2 regularization. What would go wrong if you just apply gradient descent? Fast algorithms are implemented in frameworks like scikit-learn.

drawing

As λλ is increased, so an increasing number of parameters are driven to zero. Regularization allows complex models to be trained on datasets of limited size without severe over-fitting, essentially by limiting the effective model complexity. It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio rr. When r=0r = 0, Elastic Net is equivalent to Ridge Regression, and when r=1r = 1, it is equivalent to Lasso Regression

L=MSE(w)+rαi=1nwi+1r2αi=1nwi2L = \text{MSE}(\bm w) + r\alpha\sum_{i=1}^n |w_i| + \frac{1-r}{2}\alpha\sum_{i=1}^n w_i^2

It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or Elastic Net.

Bias-Variance Decomposition

Let's consider a frequentist viewpoint of the model complexity issue, known as the bias-variance trade-off. When we discussed decision theory for regression problems, we considered various loss functions each of which leads to a corresponding optimal prediction once we are given the conditional distribution p(tx)p(t \mid \bm x). A popular choice is the squared loss function, for which the optimal prediction is given by the conditional expectation, which we denote by h(x)h(x) and which is given by

h(x)=E[tx]=tp(tx)dth(\bm x) = \mathbb E[t\mid \bm x] = \int t p(t\mid x)dt

We showed that the expected squared loss can be written in the form

E[L]=(y(x)h(x))2p(x)dx+(h(x)t)2p(x,t)dxdt\begin{align*} \mathbb E[L] = \int \Big(y(\bm x) − h(\bm x) \Big)^2 p(\bm x)d\bm x + \int \Big(h(\bm x) - t \Big)^2 p(\bm x, t)d\bm x dt \end{align*}

Recall that the second term, which is independent of y(x)y(\bm x), arises from the intrinsic noise on the data and represents the minimum achievable value of the expected loss. The first term depends on our choice for the function y(x)y(\bm x), and we will seek a solution for y(x)y(\bm x) which makes this term a minimum. Because it is nonnegative, the smallest that we can hope to make this term is zero. However, in practice we have a dataset D\mathcal D containing only a finite number N of data points not unlimited amount of data, and consequently we try to estimate the regression function h(x)h(\bm x). If we model the h(x)h(\bm x) using a parametric function y(x,w)y(\bm x, \bm w) governed by a parameter vector w\bm w, then from a Bayesian perspective the uncertainty in our model is expressed through a posterior distribution over w\bm w.

A frequentist treatment, however, involves making a point estimate of w\bm w based on the dataset D\mathcal D, and tries instead to interpret the uncertainty of this estimate through the following thought experiment: Suppose we had a large number of datasets each of size N and each drawn independently from the distribution p(t,x)p(t,\bm x). For any given dataset D\mathcal D, we can run our learning algorithm and obtain a prediction function y(x;D)y(\bm x; \mathcal D). Different datasets from the ensemble will give different functions and consequently different values of the squared loss. The performance of a particular learning algorithm is then assessed by taking the average over this ensemble of datasets. Now the expectation of squared error with respect to D\mathcal D is

ED[(y(x;D)h(t))2]==ED[(y(x;D)ED[y(x;D)]+ED[y(x;D)]h(t))2]=ED[(y(x;D)ED[y(x;D)])2]++ED[2(y(x;D)ED[y(x;D)])(ED[y(x;D)]h(t))]+ED[(ED[y(x;D)]h(t))2]==(ED[y(x;D)]h(t))2+ED[(y(x;D)ED[y(x;D)])2]\begin{align*} \mathbb E_{ \mathcal D} & \Big[ \Big( y(\bm x; \mathcal D) − h(t) \Big) ^2 \Big] = \\ & = \mathbb E_{ \mathcal D} \Big[ \Big ( y(\bm x; \mathcal D) − \mathbb E_{ \mathcal D}[ y(\bm x; \mathcal D)] + \mathbb E_{\mathcal D} [ y(\bm x; \mathcal D)] - h(t) \Big )^2\Big] \\ &= \mathbb E_{\mathcal D} \Big[ \Big( y(\bm x; \mathcal D) − \mathbb E_{ \mathcal D}[ y(\bm x; \mathcal D)] \Big )^2 \Big] + \\ & + \cancel {\mathbb E_{ \mathcal D} \Big[ 2 \big( y(\bm x; \mathcal D) - \mathbb E_{ \mathcal D}[ y(\bm x; \mathcal D)] \big) \big (\mathbb E_{ \mathcal D}[ y(\bm x; \mathcal D)] - h(t) \big) \Big ] }\\ &+ \mathbb E_{ \mathcal D}\Big [ \big ( \mathbb E_{\mathcal D} [ y(\bm x; \mathcal D)]- h(t) \big ) ^2 \Big ] = \\ & = \big ( \mathbb E_{\mathcal D} [ y(\bm x; \mathcal D)]- h(t) \big ) ^2 + \mathbb E_{\mathcal D} \Big[ \Big( y(\bm x; \mathcal D) − \mathbb E_{ \mathcal D}[ y(\bm x; \mathcal D)] \Big )^2 \Big] \end{align*}

We see that the expected squared difference between y(x;D)y(\bm x; \mathcal D) and the regression function h(x)h(\bm x) can be expressed as the sum of two terms. The first term, called the squared bias, represents the extent to which the average prediction over all datasets differs from the desired regression function. The second term, called the variance, measures the extent to which the solutions for individual datasets vary around their average, and hence this measures the extent to which the function y(x;D)y(\bm x; \mathcal D) is sensitive to the particular choice of dataset.

So far, we have considered a single input value x\bm x. If we substitute this expansion back into (2), we obtain the following decomposition of the expected squared loss:

                                      expected loss=(bias)2+variance+noise\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{expected loss} = (\text{bias})^2 + \text{variance} + \text{noise}


where:

(bias)2=(ED[y(x;D)]h(t))2p(x)dxvariance of y=ED[(y(x;D)ED[y(x;D)])2]p(x)dxnoise (Bayes error)=(h(x)t)2p(x,t)dxdt\begin{align*} \text{(bias)}^2 &= \int \big ( \mathbb E_{\mathcal D} [ y(\bm x; \mathcal D)]- h(t) \big ) ^2 p(\bm x) d\bm x \\ \text{variance of $y$} &= \int \mathbb E_{\mathcal D} \Big[ \Big( y(\bm x; \mathcal D) − \mathbb E_{ \mathcal D}[ y(\bm x; \mathcal D)] \Big )^2 \Big] p(\bm x) d\bm x \\ \text{noise (Bayes error)} &= \int \Big(h(\bm x) - t \Big)^2 p(\bm x, t)d\bm x dt \end{align*}

and the bias and variance terms now refer to integrated quantities. To have an example for the above discussion to get clear, we create 100 datasets (l=1,,100l=1,\dots,100) each containing N=25 data points, independently from the sinusoidal curve h(x)=sin(2πx)h(x) = \sin(2\pi x). For each dataset Dl\mathcal D^l, we fit a model with 24 Gaussina basis fucntion by minimizing the regularized error function (coefficient λ\lambda) to give a prediction function yl(x)y^l(x). Large value of the regularization coefficient λλ gives low variance but high bias.

drawing

Conversely on the bottom row, for which λλ is small, there is large variance (shown by the high variability between the red curves in the left plot) but low bias (shown by the good fit between the average model fit and the original sinusoidal function). Note that the result of averaging many solutions for the complex model with M = 25 is a very good fit to the regression function, which suggests that averaging may be a beneficial procedure. Indeed, a weighted averaging of multiple solutions lies at the heart of a Bayesian approach, although the averaging is with respect to the posterior distribution of parameters, not with respect to multiple datasets. The average prediction is estimated from
yˉ(x)=1100l=1100yl(x)\bar y(x) = \frac{1}{100} \sum_{l=1}^{100} y^l(x)

and the integrated squared bias and integrated variance are then given by

(bias)2=125n=125(yˉ(xn)h(xn))2variance=125n=1251100n=1100(yˉ(xn)yl(xn))2\begin{align*} \text{(bias)}^2 &= \frac{1}{25} \sum_{n=1}^{25} \big (\bar y(x_n) - h(x_n) \big)^2\\ \text{variance} &= \frac{1}{25} \sum_{n=1}^{25} \frac{1}{100} \sum_{n=1}^{100} \big (\bar y(x_n) - y^l(x_n) \big)^2 \\ % \text{noise} &= \int \{h(\bm x) - t \}^2 p(\bm x, t)d\bm x dt \end{align*}

where the integral over xx weighted by the distribution p(x)p(x) is approximated by a finite sum over data points drawn from that distribution. We see that small values of λλ allow the model to become finely tuned to the noise on each individual dataset leading to large variance. Conversely, a large value of λλ pulls the weight parameters towards zero leading to large bias.

The Bias/Variance Tradeoff

This above equation leads to an important theoretical result of statistics and Machine Learning which is the fact that a model’s expected error can be expressed as the sum of three very different errors:

Our goal is to minimize the expected loss, which we have decomposed into the sum of a (squared) bias, a variance, and a constant noise term. As we shall see, there is a trade-off between bias and variance, with very flexible models having low bias and high variance, and relatively rigid models having high bias and low variance. The model with the optimal predictive capability is the one that leads to the best balance between bias and variance. If we have an overly simple model (e.g. KNN with large k), it might have

If you have an overly complex model (e.g. KNN with k = 1), it might have

Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance. This is why it is called a tradeoff.

Bayesian Linear Regression

We have seen that the effective model complexity, governed by the number of basis functions, needs to be controlled according to the size of the dataset. Adding a regularization term to the log likelihood function means the effective model complexity can then be controlled by the value of the regularization coefficient, although the choice of the number and form of the basis functions is of course still important in determining the overall behaviour of the model. This leaves the issue of deciding the appropriate model complexity for the particular problem, which cannot be decided simply by maximizing the likelihood function, because this always leads to excessively complex models and over-fitting. Independent hold-out data can be used to determine model complexity, but this can be both computationally expensive and wasteful of valuable data. We therefore turn to a Bayesian treatment of linear regression.

Parameter Distribution

We begin our discussion of the Bayesian treatment of linear regression by introducing a prior probability distribution over the model parameters w\bm w. For the moment, we shall treat the noise precision parameter ββ as a known constant. First note that the likelihood function p(tw)p(\bm t \mid \bm w) (or p(tx,w)p(\bm t \mid \bm x, \bm w) - recall that we decided not to mention x\bm x because we are not modeling its distribution) is the exponential of a quadratic function of w\bm w. The corresponding conjugate prior is therefore given by a Gaussian distribution of the form

p(w)=N(wm0,S0)p(\bm w) = \mathcal N (\bm w \mid \bm m_0, \bm S_0)

having mean m0\bm m_0 and covariance S0\bm S_0. Next we compute the posterior distribution, which is proportional to the product of the likelihood function and the prior. For simplicity, we consider a zero-mean isotropic Gaussian governed by a single precision parameter αα so that

p(wα)=N(w0,α1I)=(α2π)(M+1)/2exp{α2wTw}p(\bm w\mid \alpha) = \mathcal N (\bm w \mid \bm 0, \alpha^{-1}\bm I) = \Big(\frac{α}{2π}\Big)^{(M+1)/2}\exp\{−\frac{α}{2}\bm w^T \bm w\}

Variables such as αα, which control the distribution of model parameters, are called hyperparameters. Using Bayes’ theorem, the posterior distribution for w\bm w is proportional to the product of the prior distribution and the likelihood function

p(wx,t,α,β)=p(tx,w,β)p(wα)p(tx,w)p(w)dwp(\bm w\mid \bm x, \bm t, α, β) = \frac{p(\bm t\mid \bm x, \bm w, β)p(\bm w \mid α)}{\int p(\bm t\mid \bm x,\bm w)p(\bm w)d\bm w}

Or,

p(wx,t,α,β)p(tx,w,β)p(wα).p(\bm w\mid \bm x, \bm t, α, β) ∝ p(\bm t\mid \bm x, \bm w, β)p(\bm w\mid α).

Due to the choice of a conjugate Gaussian prior distribution, the posterior distribution over w\bm w will also be Gaussian:

p(wt)=N(wmN,SN)p(\bm w \mid \bm t) = \mathcal N (\bm w \mid \bm m_N, \bm S_N)

where

mN=βSNΦTtSN1=αI+βΦTΦ\begin{align*} m_N & = \beta S_N Φ^T\bm t \\ S^{−1}_N & = \alpha \bm I + βΦ^TΦ \end{align*}

We can now determine w\bm w by finding the most probable value of ww given the data, in other words by maximizing the posterior distribution. This technique is called maximum posterior, or simply MAP. The negative log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w\bm w, we find that the maximum of the posterior is given by the minimum of

lnp(wt)=β2n=1N{tnwTϕ(xn)}2+α2wTw+const.\begin{align*} -\ln p( \bm w \mid \bm t) &= \frac{\beta}{2}\sum_{n=1}^N \{ t_n - \bm w^T \bm \phi(\bm x_n)\}^2 + \frac{\alpha}{2}\bm w^T \bm w + \text{const.} \end{align*}

Maximization of this posterior distribution with respect to w\bm w is equivalent to the minimization of the sum-of-squares error function with the addition of a quadratic regularization term λ=α/βλ= α/β.

We can illustrate Bayesian learning in a linear basis function model, using a simple example involving straight-line fitting. Consider a single input variable xx, a single target variable tt and a linear model of the form y(x,w)=w0+w1xy(x,\bm w) = w_0 + w_1x. Because this has just two adaptive parameters, we can plot the prior and posterior distributions directly in parameter space. We generate synthetic data from the function f(x,a)=a0+a1xf(x,\bm a) = a_0 +a_1x with parameter values a0=0.3a_0 =−0.3 and a1=0.5a_1 = 0.5 by first choosing values of xnU(1,1)x_n \sim U(−1, 1) from the uniform distribution, then evaluating f(xn,a)f(x_n, \bm a), and finally adding Gaussian noise with standard deviation of 0.2 to obtain the target values tnt_n.

Our goal is to recover the values of a0a_0 and a1a_1 from such data, and we will explore the dependence on the size of the dataset. We assume here that the noise variance is known and hence we set the precision parameter to its true value β=(1/0.2)2=25β = (1/0.2)^2 = 25. Similarly, we fix the parameter αα to 2.0. The following Figure shows the results of Bayesian learning in this model as the size of the dataset increases and demonstrates the sequential nature of Bayesian learning in which the current posterior distribution forms the prior when a new data point is observed.

drawing

The first row of this figure corresponds to the situation before any data points are observed and shows a plot of the prior distribution in w\bm w space together with six samples of the function y(x,w)y(\bm x,\bm w) in which the values of w\bm w are drawn from the prior. In the second row, we see the situation after observing a single data point. The location (x,t)(\bm x, t) of the data point is shown by a blue circle in the right-hand column. In the left-hand column is a plot of the likelihood function p(tx,w)p(t\mid \bm x, \bm w) for this data point as a function of w\bm w. Note that the likelihood function provides a soft constraint that the line must pass close to the data point, where close is determined by the noise precision ββ. For comparison, the true parameter values a0=0.3a_0 =−0.3 and a1=0.5a_1 = 0.5 used to generate the dataset are shown by a white cross in the plots in the left column. When we multiply this likelihood function by the prior from the top row, and normalize, we obtain the posterior distribution shown in the middle plot on the second row. Samples of the regression function y(x,w)y(\bm x,\bm w) obtained by drawing samples of w\bm w from this posterior distribution are shown in the right-hand plot. Note that these sample lines all pass close to the data point. The third row of this figure shows the effect of observing a second data point, again shown by a blue circle in the plot in the right-hand column. The corresponding likelihood function for this second data point alone is shown in the left plot. When we multiply this likelihood function by the posterior distribution from the second row, we obtain the posterior distribution shown in the middle plot of the third row. Note that this is exactly the same posterior distribution as would be obtained by combining the original prior with the likelihood function for the two data points. This posterior has now been influenced by two data points, and because two points are sufficient to define a line this already gives a relatively compact posterior distribution. Samples from this posterior distribution give rise to the functions shown in red in the third column, and we see that these functions pass close to both of the data points. The fourth row shows the effect of observing a total of 20 data points. The left-hand plot shows the likelihood function for the 20th data point alone, and the middle plot shows the resulting posterior distribution that has now absorbed information from all 20 observations. Note how the posterior is much sharper than in the third row. In the limit of an infinite number of data points, the posterior distribution would become a delta function centred on the true parameter values, shown by the white cross.

Predictive Distribution

In practice, we are not usually interested in the value of w\bm w itself but rather in making predictions of tt for new values of xx. This requires that we evaluate the predictive distribution defined by

p(tt,α,β)=p(tw,β)p(wt,α,β)dwp(t|\bm t, α, β) = \int p(t|\bm w, β)p(\bm w|\bm t, α, β) d\bm w

in which tt is the vector of target values from the training set, and we have omitted the corresponding input vectors. This equation involves the convolution of two Gaussian distributions, we see that the predictive distribution takes the form

p(tx,t,α,β)=N(tmNTϕ(x),σN2(x))p(t \mid \bm x, \bm t, α, β) = \mathcal N (t \mid \bm m^T_N \bm \phi(\bm x), σ^2_N (\bm x))

where the variance σN2(x)σ^2_N (\bm x) of the predictive distribution is given by

σN2(x)=1β+ϕ(x)TSNϕ(x)\sigma^2_N(\bm x) = \frac{1}{\beta} + \bm \phi(\bm x)^TS_N\bm \phi(\bm x)

The first term in the above equation represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w\bm w. Because the noise process and the distribution of w\bm w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown that σN+12(x)σN2(x)\sigma^2_{N+1}(\bm x)\leq \sigma^2_N(\bm x). In the limit NN → ∞, the second term goes to zero, and the variance of the predictive distribution arises solely from the additive noise governed by the parameter ββ. As an illustration of the predictive distribution for Bayesian linear regression models, let us return to the synthetic sinusoidal dataset.

drawing

We fit a model comprising a linear combination of Gaussian basis functions to datasets of various sizes and then look at the corresponding posterior distributions. Here the green curves correspond to the function sin(2πx)\sin(2πx) from which the data points were generated (with the addition of Gaussian noise). Datasets of size N = 1, N = 2, N = 4, and N = 25 are shown in the four plots by the blue circles. For each plot, the red curve shows the mean of the corresponding Gaussian predictive distribution, and the red shaded region spans one standard deviation either side of the mean. Note that the predictive uncertainty depends on xx and is smallest in the neighbourhood of the data points. Also note that the level of uncertainty decreases as more data points are observed.

Model Selection: Testing and Validating

The only way to know how well a model will generalize to new cases is to actually try it out on new cases. Split your data into two sets: the training set and the test set. As these names imply, you train your model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error. The problem is that you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produce the best model for that particular set. This means that the model is unlikely to perform as well on new data.

Furthermore, as well as finding the appropriate values for complexity parameters within a given model, we may wish to consider a range of different types of model in order to find the best one for our particular application. If data is plentiful, then one approach is simply to use some of the available data to train a range of models, or a given model with a range of values for its complexity parameters, and then to compare them on independent data, sometimes called a validation set or sometimes the development set, or dev set. More specifically, you train multiple models with various hyperparameters on the reduced training set (i.e., the full training set minus the validation set), and you select the model that performs best on the validation set. After this holdout validation process, you train the best model on the full training set (including the validation set), and this gives you the final model. Lastly, you evaluate this final model on the test set to get an estimate of the generalization error. If the model design is iterated many times using a limited size dataset, then some over-fitting to the validation data can occur and so it may be necessary to keep aside a third test set on which the performance of the selected model is finally evaluated. Not that if the validation set is too small, then model evaluations will be imprecise. One way to solve this problem is to perform repeated cross-validation, using many small validation sets.

In many applications, however, the supply of data for training and testing will be limited, and in order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance. One solution to this dilemma is to use cross-validation. This allows a proportion (S−1)/S of the available data to be used for training while making use of all of the data to assess performance. When data is particularly scarce, it may be appropriate to consider the case S = N, where N is the total number of data points, which gives the leave-one-out technique. In general, cross-validation works by taking the available data and partitioning it into S groups (in the simplest case these are of equal size). Then S− 1 of the groups are used to train a set of models that are then evaluated on the remaining group. This procedure is then repeated for all S possible choices for the held-out group, indicated here by the red blocks, and the performance scores from the S runs are then averaged.

drawing

One major drawback of cross-validation is that the number of training runs that must be performed is increased by a factor of S, and this can prove problematic for models in which the training is computationally expensive. A further problem with techniques such as cross-validation that use separate data to assess performance is that we might have multiple complexity parameters for a single model (for instance, there might be several regularization parameters). Exploring combinations of settings for such parameters could, in the worst case, require a number of training runs that is exponential in the number of parameters.

Linear Models for Classifications

The goal in classification is to take an input vector xx and to assign it to one of KK discrete classes CkC_k where k=1,...,Kk = 1, . . . , K. In the most common scenario, the classes are taken to be disjoint, so that each input is assigned to one and only one class. The input space is thereby divided into decision regions whose boundaries are called decision boundaries or decision surfaces. Here we consider linear models for classification, by which we mean that the decision surfaces are linear functions of the input vector x\bm x and hence are defined by (D−1)-dimensional hyperplanes within the D-dimensional input space. Datasets whose classes can be separated exactly by linear decision surfaces are said to be linearly separable. For regression problems, the target variable t was simply the vector of real numbers. In the case of classification, there are various ways of using target values to represent class labels. In the case of two-class problems, is the binary representation in which there is a single target variable t{0,1}t ∈ \{0, 1 \} such that t=1t = 1 represents class C1C_1 and t=0t = 0 represents class C2C_2. Also we can interpret the value of tt as the probability that the class is C1C_1, with the values of probability taking only the extreme values of 0 and 1. For K>2K > 2 classes, it is convenient to use a one-hot vector in which tt is a vector of length KK such that if the class is CjC_j, then all elements tkt_k of tt are zero except element tjt_j, which takes the value 1. For instance, if we have K=5K = 5 classes, then a pattern from class 2 would be given the target vector t=(0,1,0,0,0)Tt = (0, 1, 0, 0, 0)^T.

In general, there are two approaches to classification:

Discriminant Functions

The simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so that y(x)=wTx+w0y(\bm x) = \bm w^T\bm x + w_0 where w\bm w is called a weight vector, and w0w_0 is a bias (not in the statistical sense). The negative of the bias is sometimes called a threshold. An input vector x\bm x is assigned to class C1C_1 if y(x)0y(\bm x) \geq 0 and to class C2C_2 otherwise. The corresponding decision boundary is therefore defined by the relation y(x)=0y(\bm x) = 0. It is more convenient to expres it as y(x)=w~Tx~y(\bm x) = \tilde {\bm w}^T\tilde {\bm x} when w~=(w0,w)\tilde {\bm w} = (w_0, \bm w) and x~=(1,x)\tilde x = (1, \bm x).

Multiclass Classification

Some algorithms (such as Random Forest classifiers or Naive Bayes classifiers) are capable of handling multiple classes directly. Others (such as Support Vector Machine classifiers or Linear classifiers in general) are strictly binary classifiers. However, there are various strategies that you can use to perform multiclass classification using multiple binary classifiers. After training KK binary classifiers for KK classes, then when you want to classify a test example, you get the decision score from each classifier for that example and you select the class whose classifier outputs the highest score. This is called the one-versus-all (OvA) strategy (also called one-versus-the-rest). Another strategy is to train a binary classifier for every pair of classes, K(K1)/2K(K-1)/2 classifiers. This is called the one-versus-one (OvO) strategy. Some algorithms (such as Support Vector Machine classifiers) scale poorly with the size of the training set, so for these algorithms OvO is preferred since it is faster to train many classifiers on small training sets than training few classifiers on large training sets. For most binary classification algorithms, however, OvA is preferred.

Now consider the extension of linear discriminants to K>2K > 2 classes. We might be tempted be to build a KK-class discriminant by combining a number of two-class discriminant functions. However, this leads to some serious difficulties. Consider K(K− 1)/2 binary discriminant functions, one for every possible pair of classes (OVO). Each point is then classified according to a majority vote amongst the discriminant functions. However, this too runs into the problem of ambiguous regions, as illustrated in the right-hand follwing diagram.

drawing

Alternaively, consider the use of K−1 classifiers each of which solves a two-class problem of separating points in a particular class CkC_k from points not in that class (one-versus-the-rest). We can avoid these difficulties by considering a single K-class discriminant comprising K linear functions of the form

yk(x)=wkTx+wk0y_k(\bm x) = \bm w^T_k \bm x + w_{k0}

and then assigning a point x\bm x to class CkC_k if yk(x)>yj(x)y_k(\bm x) > y_j(\bm x) for all jkj \neq k. The decision boundary between class CkC_k and class CjC_j is therefore given by yk(x)=yj(x)y_k(\bm x) = y_j(\bm x) and hence corresponds to a (D−1)-dimensional hyperplane defined by

(wkwj)Tx+(wk0wj0)=0,(\bm w_k− \bm w_j)^T \bm x + (w_{k0}− w_{j0}) = 0,

which has the same form as the decision boundary for the two-class. For two lines as the discriminants, an angle bisector of them becomes the decision boundry. The decision regions of such a discriminant are always simply connected and convex. To see this, consider two points xAx_A and xBx_B both of which lie inside decision region Rk\mathcal R_k. Any point x^\hat x that lies on the line connecting xAx_A and xBx_B can be expressed in the form x^=λxA+(1λ)xB,0λ1\hat x = λx_A + (1− λ)x_B, 0 \leq \lambda \leq 1. So yk(x^)=λyk(xA)+(1λ)yk(xB)y_k(\hat x) = λy_k(x_A) + (1− λ)y_k(x_B). So yk(x^)>yj(x^)y_k(\hat {\bm x}) > y_j(\hat {\bm x}) for all jkj\neq k .

drawing

Scikit-Learn detects when you try to use a binary classification algorithm for a multiclass classification task, and it automatically runs OvA (except for SVM classifiers for which it uses OvO). If you want to force ScikitLearn to use one-versus-one or one-versus-all, you can use the OneVsOneClassifier or OneVsRestClassifier classes.

Multilabel classification is a classification system that outputs multiple binary tags. One approach evaluate multilabel classifiers is to measure the F1F_1 score for each individual label (or any other binary classifier metric discussed earlier), then simply compute the average score. This code computes the average F1F_1 score across all labels.

f1_score(y_multilabel, y_train_knn_pred, average="macro")

This assumes that all labels are equally important, which may not be the case. One simple option is to give each label a weight equal to its support (i.e., the number of instances with that target label). To do this, simply set average="weighted" in the preceding code.

Performance Measures for Classification

Evaluating a classifier is often significantly trickier than evaluating a regressor.

Unfortunately, you can’t have it both ways: increasing precision reduces recall, and vice versa. This is called the precision/recall tradeoff. Classifiers makes decisions based on a score computed by a decision function, and if that score is greater than a threshold, it assigns the instance to the positive class, or else it assigns it to the negative class. Lowering the threshold increases true positive rates (Recall) and increasing the threshold, increases the precision (reducing FP).

drawing

  1. PR Curve: A way to select a good precision/recall tradeoff is to plot precision directly against recall.

  2. The ROC Curve: Very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve plots the true positive rate (recall) against the false positive rate.

drawing

The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner). One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise. You can think about ROC as more of recall-based metric vs PR as more felxibale towards precision-based metric.

Least Squares for Classification

Consider a general classification problem with KK classes, with a one-hot vector for the target vector t\bm t . One justification for using least squares in such a context is that it approximates the conditional expectation E[tx]\mathbb E[\bm t \mid \bm x] of the target values given the input vector. Each class CkC_k is described by its own linear model so that

yk(x)=wkTx+wk0y_k(\bm x) = \bm w^T_k \bm x +w_{k0}

where k=1,...,Kk = 1, . . . , K. We can conveniently group these together using vector notation so that

y(x)=W~Tx~y(\bm x) = \tilde {\bm W}^T\bm {\tilde x}

where W~\tilde {\bm W} is a matrix whose kk-th column comprises the (D+1)-dimensional vector w~k\bm {\tilde w_k} and x~\bm {\tilde x} is the corresponding augmented input vector (1,x)T(1,\bm x)^T with a dummy input x0=1x_0 = 1. We now determine the parameter matrix W~\tilde {\bm W} by minimizing a sum-of-squares error function, as we did for regression. Consider a training dataset {xn,tn}\{x_n, t_n \} where n=1,...,Nn = 1, . . . , N, and define a matrix T\bm T whose nn-th row is the vector tnT\bm t_n^T together with a matrix X whose nnth row. The sum-of-squares error function can then be written as

ED(W~)=12Tr{(X~W~T)T(X~W~T)}.E_D(\tilde {\bm W}) = \frac{1}{2} Tr \{ (\tilde {\bm X}\tilde {\bm W}− \bm T)^T(\tilde {\bm X}\tilde {\bm W}− \bm T) \}.

Setting the derivative with respect to W~\tilde {\bm W} to zero, and rearranging, we then obtain the solution for W~\tilde {\bm W} in the form

W~LE=(X~TX~)1X~TT\tilde {\bm W}_{LE} = (\tilde {\bm X}^T\tilde {\bm X})^{-1}\tilde {\bm X}^T\bm T

We then obtain the discriminant function in the form

y(x)=W~LETx~=TTX~(XTX~)1x~y(\bm x) = \tilde {\bm W}^T_{LE}\tilde {\bm x} = {\bm T}^T \tilde {\bm X}({\bm X}^T\tilde {\bm X})^{-1} \tilde {\bm x}

An interesting property of least-squares solutions with multiple target variables is that if every target vector in the training set satisfies some linear constraint aTtn+b=0\bm a^T\bm t_n + b = 0, for some constants aa and bb, then the model prediction for any value of x\bm x will satisfy the same constraint so that aTy(x)+b=0\bm a^Ty(\bm x) + b = 0. Thus if we use one-hot vector for K classes, then the predictions made by the model will have the property that the elements of y(x)y(\bm x) will sum to 1 for any value of x\bm x.

The least-squares approach gives an exact closed-form solution for the discriminant function parameters. However, even as a discriminant function (where we use it to make decisions directly and dispense with any probabilistic interpretation) it suffers from some severe problems. We have already seen that least-squares solutions lack robustness to outliers, and this applies equally to the classification application. The following figure shows that the additional data points far from the cluster produce a significant change in the location of the decision boundary, even though these points would be correctly classified by the original decision boundary. The sum-of-squares error function penalizes predictions that are ‘too correct’ in that they lie a long way on the correct side of the decision.

drawing

However, problems with least squares can be more severe than simply lack of robustness. This shows a synthetic dataset drawn from three classes in a two-dimensional input space (x1,x2)(x_1, x_2), having the property that linear decision boundaries can give excellent separation between the classes. The follwoing figure shows the decision boundary found by least squares (magenta curve) and also by the logistic regression model (green curve). Indeed, the technique of logistic regression, described later, gives a satisfactory solution as seen in the right-hand plot. However, the least-squares solution gives poor results when extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic regression.

drawing

The failure of least squares should not surprise us when we recall that it corresponds to maximum likelihood under the assumption of a Gaussian conditional distribution of the target tx\bm t \mid \bm x, whereas binary or one-hot target vectors clearly have a distribution that is far from Gaussian. By adopting more appropriate probabilistic models, we shall obtain classification techniques with much better properties than least squares. From historical point of view, there is another linear discriminant model called perceptron algorithm. See p.192 in Pattern Recognition and Machine Learning for more.

Probabilistic Generative Models

Here we shall adopt a generative approach in which we model the class-conditional densities p(xCk)p(\bm x\mid C_k), as well as the class priors p(Ck)p(C_k), and then use these to compute posterior probabilities p(Ckx)p(C_k \mid \bm x) through Bayes’ theorem. First consider the case of two classes. The posterior probability for class C1C_1 can be written as

p(C1x)=p(xC1)p(C1)p(xC1)p(C1)+p(xC2)p(C2)=11+ea=σ(a)\begin{align*} p(C_1\mid x) & = \frac{p(\bm x\mid C_1) p(C_1)}{p(\bm x\mid C_1) p(C_1) + p(\bm x\mid C_2) p(C_2)} \\ & = \frac{1}{1+ e^{-a}} = \sigma(a) \end{align*}

where σ(a)σ(a) is the logistic sigmoid function (the term ‘sigmoid’ means S-shaped) and

a=lnp(xC1)p(C1)p(xC2)p(C2)=lnp(C1x)p(C2x)\begin{align*} a & = \ln \frac{p(\bm x\mid C_1) p(C_1)}{p(\bm x \mid C_2) p(C_2)}\\ & = \ln \frac{p(C_1\mid \bm x)}{p(C_2 \mid \bm x)} \end{align*}

represents the log of the ratio of probabilities for the two classes, also known as the log odds. We shall shortly consider situations in which a(x)a(\bm x) is a linear function of x\bm x, in which case the posterior probability is governed by a generalized linear model. For the case of K>2K > 2 classes, we have

p(Ckx)=p(xCk)p(Ck)jp(xCj)p(Cj)=eakjeaj\begin{align*} p(C_k \mid \bm x) & = \frac{p(\bm x\mid C_k) p(C_k)}{\sum_j p(\bm x \mid C_j) p(C_j)}\\ & = \frac{e^{a_k}}{\sum_j e^{a_j}} \end{align*}

which is known as the normalized exponential and can be regarded as a multiclass generalization of the logistic sigmoid called the softmax function as it represents a smoothed version of the ‘max’ function because, if akaja_k ≫ a_j for all jkj \neq k, then p(Ckx)1p(C_k\mid \bm x) \approx 1, and p(Cjx)0p(C_j \mid \bm x) \approx 0. Here the quantities aka_k are defined by

ak=lnp(xCk)p(Ck)a_k = \ln p(\bm x \mid C_k)p(C_k)

which are the unnormalized log probabilities called logits. To extract posterior probabilites from logits, we find their exponential followed by normalizationg which is what softmax does. We now investigate the consequences of choosing specific forms for the class-conditional densities, looking first at continuous input variables x\bm x and then discussing briefly the case of discrete inputs.

Continuous Inputs

Let us assume that the class-conditional densities are Gaussian and then explore the resulting form for the posterior probabilities. To start with, we shall assume that all classes share the same covariance matrix. Thus the density for class CkC_k is given by

p(xCk)=1(2π)D/21Σ1/2  e12(xµk)TΣ1(xµk)p(\bm x \mid C_k) = \frac{1}{(2π)^{D/2}} \frac{1}{|\Sigma|^{1/2}}\; e^{ -\frac{1}{2}(\bm x− \bm µ_k)^T \bm Σ^{−1}(\bm x− \bm µ_k)}

Consider first the case of two classes:

p(C1x)=σ(wTx+w0)p(C_1 \mid \bm x) = \sigma(\bm w^T \bm x + w_0)

or equivalently,

lnp(C1x)p(C2x)=lnp(xC1)p(C1)p(xC2)p(C2)=wTx+w0\ln \frac{ p(C_1\mid \bm x)}{p(C_2\mid \bm x)} = \ln \frac{p(\bm x\mid C_1) p(C_1)}{p(\bm x \mid C_2) p(C_2)} = \bm w^T \bm x + w_0

Due to the assumption of common covariance matrices, this last equation implies:

w=Σ1(μ1μ2)w0=12μ1TΣ1μ1+12μ2TΣ1μ2+lnp(C1)p(C2)\begin{align*} \bm w & = \Sigma^{-1} (\bm \mu_1 - \bm \mu_2) \\ w_0 & = -\frac{1}{2} \bm \mu_1^T \Sigma^{-1} \bm \mu_1 + \frac{1}{2} \bm \mu_2^T \Sigma^{-1} \bm \mu_2 + \ln \frac{p(C_1)}{p(C_2)} \end{align*}

This result is illustrated for the case of a two-dimensional input space x\bm x in the following figure. The left-hand plot shows the class-conditional densities for two classes, denoted red and blue. On the right is the corresponding posterior probability p(C1x)p(C_1\mid \bm x), which is given by a logistic sigmoid of a linear function of x\bm x. The surface in the right-hand plot is coloured using a proportion of red ink given by p(C1x)p(C_1 \mid \bm x) and a proportion of blue ink given by p(C2x)=1p(C1x)p(C_2\mid \bm x) = 1 − p(C_1 \mid \bm x).

drawing

The decision boundaries correspond to surfaces along which the posterior probabilities p(Ckx)p(C_k \mid \bm x) are constant and so will be given by linear functions of x\bm x, and therefore the decision boundaries are linear in input space. In the case of 2 classes, the decision boundry is wTx+w0=0\bm w^T \bm x + w_0 = 0. The prior probabilities p(Ck)p(C_k) enter only through the bias parameter w0w_0 so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability. For the general case of K classes we have:
ak(x)=wkTx+wk0=lnp(xCk)p(Ck)a_k(\bm x) = \bm w^T_k \bm x + w_{k0} = \ln p(\bm x \mid C_k)p(C_k)

where:

wk=Σ1μk,w0=12μkTΣ1μk+lnp(Ck).\begin{align*} \bm w_k & = \Sigma^{-1} \bm \mu_k, \\ w_0 & = -\frac{1}{2} \bm \mu_k^T \Sigma^{-1} \bm \mu_k + \ln p(C_k). \end{align*}

The resulting decision boundaries, corresponding to the minimum misclassification rate, will occur when two of the posterior probabilities (the two largest) are equal ak(x)=aj(x)a_k(\bm x) = a_j(\bm x), and so will be defined by linear functions of x\bm x, and so again we have a generalized linear model called linear discriminant (LDA). The KK centroids in pp-dimensional input space lie in an affine subspace of dimension ≤ K−1, and if pp is much larger than KK, this will be a considerable drop in dimension if we projects input space into this subspace. Moreover, in locating the closest centroid for a given x\bm x, we can ignore orthogonal distance to this subspace, since they will contribute equally to each class. Thus we might just as well project the x\bm x onto this centroid-spanning subspace HK1H_{K−1}, and make distance comparisons there. Thus there is a fundamental dimension reduction in LDA, namely, that we need only consider the data in a subspace of dimension at most K1K−1. If K=3K = 3, for instance, this could allow us to view the data in a two-dimensional plot, color-coding the classes.

If we relax the assumption of a shared covariance matrix and allow each class-conditional density p(xCk)p(\bm x \mid C_k) to have its own covariance matrix ΣkΣ_k, then the earlier cancellations will no longer occur, and we will obtain quadratic functions of x\bm x, giving rise to a quadratic discriminant. If we make the futher assumption of independence of features conditioned on classes, we get Naive Bayes:

p(xCk)=j=1Dp(xjCk)p(\bm x \mid C_k) = \prod_{j=1}^D p(x_j \mid C_k)

The probabilities p(xjCk)p(x_j \mid C_k) could be modeled as:

Maximum likelighod estimation of Naive Bayes parameters can be easily computed from empirical data:

logp(Ckx)=logp(Ck)+jlogp(xjCk)\log p(C_k \mid x) = \log p(C_k) + \sum_j \log p(x_j \mid C_k)

From training data, estimate:

If a feature never appears in training for a class: use Laplace smoothing. We predict the category by performing inference in the model using Bayes’ Rule:

p(Ckx)=p(xCk)p(Ck)kp(xCk)p(Ck)=p(Ck)jp(xjCk)kp(Ck)jp(xjCk)\begin{align*} p(C_k\mid \bm x) & = \frac{p(\bm x\mid C_k)p(C_k)}{\sum_k p(\bm x\mid C_k)p(C_k)}\\ & = \frac{p(C_k) \prod_j p(x_j\mid C_k)}{\sum_k p(C_k)\prod_j p(x_j\mid C_k)} \end{align*}

We need not compute the denominator if we’re simply trying to determine the mostly likely class. Naive Bayes works surprisingly well but can perform poorly when features are correlated. It scales to very high-dimensional data and decision boundaries are linear in log-space. It is used for Text Classification (spam detection, sentiment) or as a quick baseline model for many tasks.

Maximum Likelihood Solution for LDA

Once we have specified a parametric functional form for the class-conditional densities p(xCk)p(\bm x \mid C_k), we can then determine the values of the parameters, together with the prior class probabilities p(Ck)p(C_k), using maximum likelihood. This requires a dataset comprising observations of x\bm x along with their corresponding class labels.

Consider first the case of two classes, each having a Gaussian class-conditional density with a shared covariance matrix, and suppose we have a dataset {xn,tn}\{x_n, t_n \} where n=1,...,Nn = 1, . . . , N. Here tn=1t_n = 1 denotes class C1C_1 and tn=0t_n = 0 denotes class C2C_2. We denote the prior class probability p(C1)=πp(C_1) = π, so that p(C2)=1πp(C_2) = 1− π. For example, for a data point xnx_n from class C1C_1, we have:

p(xn,C1)=p(C1)p(xnC1)=πN(xnμ1,Σ)p(\bm x_n, C_1) = p(C_1)p(\bm x_n \mid C_1) = \pi \mathcal N(\bm x_n \mid \mu_1, \Sigma)

Thus the likelihood function is given by:

p(tπ,μ1,μ2,Σ)=n=1N[πN(xnμ1,Σ)]tn[(1π)N(xnμ2,Σ)]1tnp(t \mid π, \mu_1, \mu_2, \Sigma) = \prod_{n=1}^N [π \mathcal N (\bm x_n \mid \bm \mu_1, Σ)]^{t_n} [(1− π)\mathcal N (\bm x_n \mid \bm \mu_2, Σ)]^{1−t_n}

where t=(t1,,tN)T\bm t = (t_1, \dots, t_N)^T. Setting the derivative with respect to ππ equal to zero and rearranging, we obtain:

π=1Nn=1Ntn=N1N=N1N1+N2π = \frac{1}{N}\sum_{n=1}^N t_n = \frac{N_1}{N} = \frac{N_1}{N_1+N_2}

Thus the maximum likelihood estimate for ππ is simply the fraction of points in class C1C_1 as expected. This result is easily generalized to the multiclass case where again the maximum likelihood estimate of the prior probability associated. Setting the derivative with respect to μ1\mu_1 to zero and rearranging, we obtain

μ1=1N1n=1Ntnxn\mu_1 = \frac{1}{N_1}\sum_{n=1}^N t_n \bm x_n

which is simply the mean of all the input vectors xn assigned to class C1C_1. It is similar for μ2\mu_2. The maximum likelihood solution for the shared covariance matrix Σ\bm \Sigma is

Σ=N1NS1+N2NS2,S1=1N1nC1(xnμ1)(xnμ1)T,S2=1N2nC2(xnμ2)(xnμ2)T,\begin{align*} \bm \Sigma & = \frac{N_1}{N}\bm S_1 + \frac{N_2}{N}\bm S_2, \\ \bm S_1 &= \frac{1}{N_1} \sum_{n\in C_1} (\bm x_n - \bm \mu_1)(\bm x_n - \bm \mu_1)^T,\\ \bm S_2 &= \frac{1}{N_2} \sum_{n\in C_2} (\bm x_n - \bm \mu_2)(\bm x_n - \bm \mu_2)^T, \end{align*}

which represents a weighted average of the covariance matrices associated with each of the two classes separately. This result is easily extended to the KK class problem to obtain the corresponding maximum likelihood solutions for the parameters in which each class-conditional density is Gaussian with a shared covariance matrix. Note that the approach of fitting Gaussian distributions to the classes is not robust to outliers, because the maximum likelihood estimation of a Gaussian is not robust as it was equivalent to minimizing least squares errors.

Regularized Discriminant Analysis (RDA)

Friedman (1989) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. These methods are very similar in flavor to ridge regression. The regularized covariance matrices have the form
Σ^k(RDA)=αΣ^k+(1α)Σ^+γI\hat \Sigma_k^{(\text{RDA})} = \alpha \hat \Sigma_k+ (1-\alpha) \hat \Sigma + \gamma \bm I

where Σ^\hat \Sigma is the pooled covariance matrix as used in LDA and Σ^k\hat \Sigma_k are the class specific covariance matrix defined above like S1S_1. Here α[0,1]α ∈[0,1] allows a continuum of models between LDA and QDA, and needs to be specified. Hyperparameter γ0γ ≥ 0 adds a scaled identity matrix (ridge regularization) to stabilize covariance estimates and helps especially when the number of features is large compared to samples. In practice α,γα, \gamma can be chosen based on the performance of the model on validation data, or by cross-validation.

Discrete Features

Let us now consider the case of discrete feature values xix_i. For simplicity, we begin by looking at binary feature values xi{0,1}x_i \in \{0, 1 \} and discuss the extension to more general discrete features shortly. If there are D features, then a general distribution would correspond to a table of 2D2^D numbers for each class, containing 2D12^D− 1 independent variables (due to the summation constraint). Because this grows exponentially with the number of features, we might seek a more restricted representation. Here we will make the Naive Bayes assumption in which the feature values are treated as independent, conditioned on the class CkC_k. Thus we have class-conditional distributions of the form

p(xCk)=i=1Dμkixi(1μkixi)1xip(\bm x \mid C_k) = \prod_{i=1}^D \mu^{x_i}_{ki} (1- \mu^{x_i}_{ki})^{1-x_i}

which contain D independent parameters for each class. This implies that:

ak(x)=lnp(Ck)+i=1D(xilnμki+(1xi)ln(1μki))a_k(\bm x) = \ln p(C_k) + \sum_{i=1}^D (x_i\ln \mu_{ki} + (1-x_i) \ln(1-\mu_{ki}))

which again are linear functions of the input features xix_i. Analogous results are obtained for discrete variables each of which can take M > 2 states. For both Gaussian distributed and discrete inputs, the posterior class probabilities are given by generalized linear models with logistic sigmoid (K=2 classes) or softmax (K 2 classes) activation functions. These are particular cases of a more general result obtained by assuming that the class-conditional densities p(xCk)p(\bm x|C_k) are members of the exponential family of distributions. Many techniques are based on models for the class densities:

Probabilistic Discriminative Models

For the two-class classification problem, we have seen that the posterior probability of class C1C_1 can be written as a logistic sigmoid acting on a linear function of xx, for a wide choice of class-conditional distributions p(xCk)p(\bm x \mid C_k). Similarly, for the multiclass case, the posterior probability of class CkC_k is given by a softmax transformation of a linear function of x\bm x. For specific choices of the class-conditional densities p(xCk)p(\bm x\mid C_k), we have used maximum likelihood to determine the parameters of the densities as well as the class priors p(Ck)p(C_k) and then used Bayes’ theorem to find the posterior class probabilities.

However, an alternative approach is to use the functional form of the generalized linear model explicitly and to determine its parameters directly by using maximum likelihood. The indirect approach to finding the parameters of a generalized linear model, by fitting class-conditional densities and class priors separately and then applying Bayes’ theorem, represents an example of generative modeling, because we could take such a model and generate synthetic data by drawing values of x\bm x from the marginal distribution p(x)p(\bm x). In the direct approach, we are maximizing a likelihood function defined through the conditional distribution p(Ckx)p(C_k \mid \bm x), which represents a form of discriminative training. One advantage of the discriminative approach is that there will typically be fewer adaptive parameters to be determined.

Logistic Regression

The posterior probability of class C1C_1 can be written as a logistic sigmoid acting on a linear function of the feature vector ϕ\phi so that

p(C1ϕ(x))=y(ϕ(x))=σ(wTϕ(x))p(C_1 \mid \phi(\bm x)) = y(\phi(\bm x)) = \sigma(\bm w^T \phi(\bm x))

For an M-dimensional feature space ϕ\phi, this model has M adjustable parameters. By contrast, if we had fitted Gaussian class conditional densities using maximum likelihood, we would have used 2M parameters for the means and M(M+1)/2M(M + 1)/2 parameters for the (shared) covariance matrix. For a dataset {ϕn,tn}\{\phi_n, t_n\}, where tn{0,1}t_n ∈ \{0, 1\} and ϕn=ϕ(xn)\phi_n= \phi(x_n) with n=1,...,Nn=1, . . . , N, the likelihood function can be written:

p(tw)=n=1Nyntn(1yn)1tnp(\bm t \mid \bm w) = \prod_{n=1}^N y_n^{t_n} (1-y_n )^{1-t_n}

where t=(t1,...,tN)T\bm t = (t_1, . . . , t_N )^T and yn=p(C1ϕn)=σ(wTϕn)y_n = p(C_1 \mid \phi_n) = \sigma(\bm w^T \phi_n). As usual, we can define an error function by taking the negative logarithm of the likelihood, which gives the cross-entropy error function in the form

E(w)=lnp(tw)=n=1N(tnlnyn+(1tn)ln(1yn))E(\bm w) =− \ln p(\bm t\mid \bm w) =− \sum_{n=1}^N (t_n \ln y_n + (1− t_n) \ln(1− y_n) )

Taking the gradient of the error function with respect to w\bm w, we obtain

wE(w)=n=1N(yntn)ϕn∇_wE(\bm w) = \sum_{n=1}^N (y_n− t_n)\phi_n

It is worth noting that maximum likelihood can exhibit severe overfitting for datasets that are linearly separable. This arises because the maximum likelihood solution occurs when the hyperplane corresponding to σ=0.5σ = 0.5, equivalent to wTϕ=0\bm w^T \phi=0, separates the two classes and the magnitude of w\bm w goes to infinity to maximize the likelihood. In this case, the logistic sigmoid function becomes infinitely steep in feature space, corresponding to a Heaviside step function, so that every training point from each class kk is assigned a posterior probability p(Ckx)=1p(C_k|\bm x) = 1 which is unstable and overfitting effect (not good generalizable). Note that the problem will arise even if the number of data points is large compared with the number of parameters in the model, so long as the training data set is linearly separable. The singularity can be avoided by inclusion of a prior and finding a MAP solution for w\bm w, or equivalently by adding a regularization term to the error function. In general, MLE will try to maximize the likelihood at all costs, even if:

In logistic regression, MLE can overfit because it has no mechanism to limit model complexity. In high dimensions or noisy data, it may assign extreme weights to maximize likelihood, leading to poor generalization. Regularization (e.g., L2) helps prevent this by penalizing large weights.

Question Naive Bayes Logistic Regression
Probabilistic? ✅ Yes ✅ Yes
Linear? ✅ Yes (in log-space) ✅ Yes
Learns from data? Estimates from counts Learns optimal weights
Assumes independence? ❗Yes ❌ No
Fast to train? ✅ Extremely ❌ Slower

Combining Models

Model combination is to select one of the models to make the prediction depending on the input variables. Thus different models become responsible for making predictions in different regions of input space. One widely used framework of this kind is known as a decision tree in which the selection process can be described as a sequence of binary selections corresponding to the traversal of a tree structure. In this case, the individual models are generally chosen to be very simple, and the overall flexibility of the model arises from the input-dependent selection process. Decision trees can be applied to both classification and regression problems. One limitation of decision trees is that the division of input space is based on hard splits in which only one model is responsible for making predictions for any given value of the input variables. The decision process can be softened by moving to a probabilistic framework for combining models like Gaussian Mixture Models. Such models can be viewed as mixture distributions in which the component densities, as well as the mixing coefficients, are conditioned on the input variables and are known as mixtures of experts.

An ensemble of models is a set of models whose individual decisions are combined in some way to to create a new model. For this to be nontrivial, the models must differ somehow, e.g.

In fact, ensemble methods work best when the submodels are as independent from one another as possible. One way to get diverse models is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy.

Tree-Based Methods

Tree-based methods partition the feature space into a set of rectangles whose edges are aligned with the axes, and then assigning a simple model (for example, a constant) to each region. This process repeats recursively until the desired performance is reached. A well-known tree algorithm is the Decision Tree. Decision Trees make very few assumptions about the training data (as opposed to linear models, which obviously assume that the data is linear, for example). They can be viewed as a model combination method in which only one model is responsible for making predictions at any given point in input space. The process of selecting a specific model, given a new input x\bm x, can be described by a sequential decision making process corresponding to the traversal of a binary tree (one that splits into two branches at each node). At internal nodes, spliting variable and its threshold is calculated, branching is determined by threshold value and leaf nodes are outputs (predictions). Each path from root to a leaf defines a region RmR_m of input space. The following figure illustration of a recursive binary partitioning of the input space, along with the corresponding tree structure.

drawing

The first step divides the whole of the input space into two regions according to whether x1θ1x_1 \leq θ_1 or x1>θ1x_1 > θ_1 where θ1θ_1 is a parameter of the model. This creates two subregions, each of which can then be subdivided independently. For instance, the region x1θ1x_1 \leq θ_1 is further subdivided according to whether x2θ2x_2 \leq θ_2 or x2>θ2x_2 > θ_2, giving rise to the regions denoted AA and BB. The recursive subdivision can be described by the traversal of the binary tree shown below:

drawing

For any new input x\bm x, we determine which region it falls into by starting at the top of the tree at the root node and following a path down to a specific leaf node according to the decision criteria at each node. If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, and most likely overfitting it. Such a model is often called a nonparametric model, not because it does not have any parameters (it often has a lot) but because the number of parameters is not determined prior to training, so the model structure is free to stick closely to the data. To avoid overfitting the training data, you need to restrict the Decision Tree’s freedom during training. As you know by now, this is called regularization. Reducing max_depth will regularize the model and thus reduce the risk of overfitting. Other parameters include min_samples_split (the minimum number of samples a node must have before it can be split), min_samples_leaf (the minimum number of samples a leaf node must have), min_weight_fraction_leaf (same as min_samples_leaf but expressed as a fraction of the total number of weighted instances), max_leaf_nodes (maximum number of leaf nodes), and max_features (maximum number of features that are evaluated for splitting at each node). Increasing min_* hyperparameters or reducing max_* hyperparameters will regularize the model.

Decision tree follows a greedy algorithm: it greedily searches for an optimum split at the top level, then repeats the process at each level. It does not check whether or not the split will lead to the lowest possible impurity several levels down. A greedy algorithm often produces a reasonably good solution, but it is not guaranteed to be the global optimal solution. Unfortunately, finding the optimal tree is known to be an NP-Complete problem: it requires O(exp(m))\mathcal O(\exp(m)) time, making the problem intractable even for fairly small training sets. This is why we must settle for a “reasonably good” solution.

Note: P is the set of problems that can be solved in polynomial time. NP is the set of problems whose solutions can be verified in polynomial time. An NP-Hard problem is a problem to which any NP problem can be reduced in polynomial time. An NP-Complete problem is both NP and NP-Hard. A major open mathematical question is whether or not P = NP. If P ≠ NP (which seems likely), then no polynomial algorithm will ever be found for any NP-Complete problem (except perhaps on a quantum computer).


A Decision Tree can also estimate the probability that an instance belongs to a particular class kk: first it traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of class kk in this node. Decision trees are not probabilistic graphical models.

Regression Trees

Suppose our data consists of p inputs and a response, for each of NN observations: that is, (xi,yi)(x_i,y_i) for i=1,2,...,Ni = 1,2,...,N, with xi=(xi1,xi2,,xip)x_i = (x_{i1},x_{i2},\dots,x_{ip}). The algorithm needs to automatically decide on the splitting variables and split points, and also what topology (shape) the tree should have. Learning the simplest (smallest) decision tree is an NP complete problem. We proceed with a greedy algorithm. Starting with all of the data, consider a splitting variable jj and split point ss, and define the pair of half-planes R1(j,s)={XXjs}R_1(j,s) = \{ X \mid X_j ≤s \} and R2(j,s)={XXj>s}R_2(j,s) = \{ X \mid X_j >s \}. Then we seek the splitting variable jj and split point ss that solve

minj,s(minc1xiR1(j,s)(yic1)2+minc2xiR2(j,s)(yic2)2)\min_{j, s} \Big( \min_{c_1} \sum_{x_i ∈R_1(j,s)} (y_i−c_1)^2 + \min_{c_2} \sum_{x_i ∈R_2(j,s)} (y_i−c_2)^2 \Big)

if we adopt the sum of squares as a measure for distance. For any choice jj and ss, the inner minimization is solved by
c^1=ave(yixiR1(j,s)),c^2=ave(yixiR2(j,s)).\hat c_1 = \text{ave}(y_i \mid x_i ∈ R_1(j,s)),\\ \hat c_2 = \text{ave}(y_i\mid x_i ∈R_2(j,s)).

For each splitting variable, the determination of the split point s can be done very quickly and hence by scanning through all of the inputs, determination of the best pair (j,s) is feasible. Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated on all of the resulting regions. If we have a partition into MM regions R1,R2,,RMR_1,R_2,\dots,R_M , and we model the response as a constant cmc_m in each region: f(x)=m=1McmI(xRm)f(x) = \sum_{m=1}^M c_m I(x∈R_m).

Classification Trees

If the target is a classification outcome taking values 1,2,...,K, the only changes needed in the tree algorithm pertain to the criteria for splitting nodes and pruning the tree. For regression we used the squared-error node but this is not suitable for classification. In a node mm, representing a region RmR_m with NmN_m observations, let

p^mk=1NmxiRmI(yi=k)\hat p_{mk} = \frac{1}{N_m}\sum_{x_i \in R_m} I(y_i = k)

The proportion of observation in class kk in node mm. Then we classify the observations in node mm to class
k(m)=arg maxkp^mkk(m) = \argmax_k \hat p_{mk}

the majority class in node mm. Different measures of node impurity include the following:

Misclassification error:1p^mkGini index:k=1Kp^mk(1p^mk)Cross-entropy:k=1Kp^mklogp^mk\begin{align*} \text{Misclassification error}&: 1 - \hat p_{mk}\\ \text{Gini index}&: \sum _{k=1}^K \hat p_{mk}(1 - \hat p_{mk}) \\ \text{Cross-entropy}&: - \sum _{k=1}^K \hat p_{mk}\log\hat p_{mk} \end{align*}

All three are similar, but cross-entropy and the Gini index are differentiable, and hence more amenable to numerical optimization.

Advantages of decision trees over KNN
Advantages of KNN over decision trees

One major problem with trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. The major reason for this instability is the hierarchical nature of the process: the effect of an error in the top split is propagated down to all of the splits below it. As we’ll see next lecture, ensembles of decision trees are much stronger at the cost of losing interpretability.

Bagging

The simplest way to construct a ensemble is to average the predictions of a set of individual independent models. A common approach is to use the same training algorithm for every model, but to train them on different random subsets of the training set. Given we have only a single dataset, this allows us to introduce variability between the different models within the committee. One approach is to use bootstrap datasets. Consider a regression problem in which we are trying to predict the value of a single continuous variable, and suppose we generate MM bootstrap datasets and then use each to train a separate copy ym(x)y_m(\bm x) of a model where m=1,...,Mm = 1, . . . , M. The committee prediction is given by

yCOM=1Mm=1Mym(x)y_{\text{COM}} = \frac{1}{M}\sum_{m=1}^M y_m(\bm x)

This procedure is known as bootstrap aggregation or bagging. Suppose the true regression function that we are trying to predict is given by h(x)h(x), so that the output of each of the models can be written as the true value plus an error in the form ym(x)=h(x)+ϵm(x)y_m(\bm x) = h(\bm x) + ϵ_m(\bm x) so that E[ϵm(x)]=0\mathbb E [ϵ_m(\bm x)] = 0. How does this affect the three terms of the expected loss in the bias-variance equation?

This apparently dramatic result suggests that the average error of a model can be reduced by a factor of MM simply by averaging MM versions of the model. Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. In practice, the errors are typically highly correlated, and the reduction in overall error is generally small. It can, however, be shown that the expected committee error will not exceed the expected error of the constituent models, so that Var[yCOM]Var[ym(x)]\text{Var} [y_{\text{COM}}] \leq \text{Var} [y_m(\bm x)]. Ironically, it can be advantageous to introduce additional variability into your algorithm, as long as it reduces the correlation between sampled predictions. It can help to use average over multiple algorithms, or multiple configurations of the same algorithm. That is why random forests exists.

Random Forest

If all classifiers are able to estimate class probabilities (i.e., they have a pre dict_proba() method in Scikit-Learn) then you can predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression.

The Random Forest (RF) algorithm is a bagging algorithm of decision trees, with one extra trick to decorrelate the predictions: when choosing the best feature to split each node of the decision tree, choose it from a random subset of dd input features, and only consider splits on those features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. Random forests often work well with no tuning whatsoever. In short,

Feature Importance

Yet another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average across all trees in the forest. More precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are associated with it. Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

Boosting

Boosting (originally called hypothesis boosting) refers to any ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train models sequentially, each trying to correct its predecessor. AdaBoost (short for Adaptive Boosting) and Gradient Boosting are popular ones. Boosting can achieve low bias but sensitive to overfitting without regularization. Boosting can give good results even if the base models have a performance that is only slightly better than random, and hence sometimes the base models are known as weak learners. Weak learner is a learning algorithm that outputs a hypothesis (e.g., a classifier) that performs slightly better than chance, e.g., it predicts the correct label with probability 0.6. We are interested in weak learners that are computationally efficient.

AdaBoost improves models sequentially by increasing the weight of examples misclassified by the previous model implying higher cost for the loss function which the model needs to reduce. Given a vector of predictor variables XX, a classifier G(X)G(X) produces a prediction taking one of the two values {1,1}\{−1,1\}. The error rate on the training sample is

errm=i=1NwiI(yiG(xi))iwi.\text{err}_m = \frac{\sum_{i=1}^N w_i I(y_i \neq G(x_i))}{\sum_i w_i}.

The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers Gm(x)G_m(x), m=1,2,...,Mm = 1,2,...,M. The data modifications at each boosting step consist of applying weights w1,w2,,wNw_1,w_2,\dots,w_N to each of the training observations (xi,yi)(x_i,y_i), i=1,2,...,Ni= 1,2,...,N. Initially all of the weights are set to wi=1/Nw_i = 1/N, so that the first step simply trains the classifier on the data in the usual manner. For each successive iteration m=2,3,...,Mm = 2,3,...,M the observation weights are individually modified and the classification algorithm is reapplied to the weighted observations:

αm=log((1errm)/errm),wi(m+1)wi(m)eαmI(yiGm(xi))\begin{align*} \alpha_m & = \log\big((1- \text{err}_m)/ \text{err}_m\big), \\ w_i^{(m+1)} & \leftarrow w_i^{(m)} e^{\alpha_m I(y_i \neq G_m(x_i))} \end{align*}

At step mm, those observations that were misclassified by the previous classifier Gm1(x)G_{m−1}(x) have their weights increased, whereas the weights are decreased for those that were classified correctly. Thus as iterations proceed, observations that are difficult to classify correctly receive ever-increasing influence. Each weak learner is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence to reduce the weighted error because misclassifying a high-weight example hurts more than misclassifying a low-weight one. In AdaBoost, input weights tell the new learner which samples to focus on — they affect the loss function during training. The learner is trained to do well on the weighted data distribution, not uniformly across the dataset. Make predictions using the final model, which is given by

YM(x)=sign(m=1Mαmym(x))Y_M (\bm x) = \text{sign} \Big( \sum_{m=1}^M \alpha_m y_m(\bm x) \Big)

Key steps of AdaBoost:

AdaBoost reduces bias by making each classifier focus on previous mistakes. Friedman et al. (2000) gave a different and very simple interpretation of boosting in terms of the sequential minimization of an exponential error function. See page 659 of The Elements of Statistical Learning.

Gradient Boosting Trees (XGBoost)

Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor. Boosting trees also have extra parameters to induce more variability in predictors. Parameters like subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree. For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by now, this trades a higher bias for a lower variance. An optimized implementation of Gradient Boosting is available in the popular python library XGBoost, which stands for Extreme Gradient Boosting. XGBoost aims at being extremely fast, scalable and portable.

XGBoost is a boosting algorithm where each training step will add one entirly new tree from scratch, so that at step tt the ensemble contains K=tK=t trees. Mathematically, we can write our model in the form
y^i=k=1Kfk(xi)\hat y_i = \sum_{k=1}^K f_k(x_i)

where functions fkf_k each containing the structure of the tree and the leaf scores. It is intractable to learn all the trees at once. Instead, use an additive strategy: fix what has been learned, and add one new tree at a time. Which tree do we want at each step? Add the one that optimizes our objective!

obj(t)=i=1nl(yi,y^i(t))+k=1tω(fk)=i=1nl(yi,y^i(t1)+ft(xi))+ω(ft)+const.i=1n(l(yi,y^i(t1))+gift(xi)+12hift(xi)2)+ω(ft)+const.=i=1n(gift(xi)+12hift(xi)2)+ω(ft)+const.\begin{align*} \text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat y_i^{(t)}) + \sum_{k=1}^t \omega(f_k) \\ & = \sum_{i=1}^n l(y_i, \hat y_i^{(t-1)} + f_t(x_i)) + \omega(f_t) + \text{const.} \\ & \approx \sum_{i=1}^n \Big( l(y_i, \hat y_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2}h_i f_t(x_i)^2 \Big) + \omega(f_t) + \text{const.} \\ & = \sum_{i=1}^n \Big( g_i f_t(x_i) + \frac{1}{2}h_i f_t(x_i)^2 \Big) + \omega(f_t) + \text{const.} \end{align*}

where ll is our loss function and

gi=yl(yi,y)y=y^i(t1),hi=2y2l(yi,y)y=y^i(t1),\begin{align*} g_i & = \frac{\partial}{\partial y} l(y_i,y) |_{y=\hat y_i^{(t-1)}}, \\ h_i & = \frac{\partial^2}{\partial y^2} l(y_i,y) |_{y=\hat y_i^{(t-1)}}, \end{align*}

and where ω(fk)\omega(f_k) is the complexity of the tree fkf_k, defined in detail later. The third line is Taylor expansion of the loss function ll up to the second order used by XGBoost. After removing constants, the objective approximately becomes:

i=1n(gift(xi)+12hift(xi)2)+ω(ft)\sum_{i=1}^n \Big( g_i f_t(x_i) + \frac{1}{2}h_i f_t(x_i)^2 \Big) + \omega(f_t)

which should be minimized for the new tree. One important advantage of this definition is that as long as loss function concerned, the value of the objective function only depends on gig_i and hih_i. This is how XGBoost supports custom loss functions. We can optimize every loss function, including logistic regression and pairwise ranking, using exactly the same solver that takes gig_i and hih_i as input! The value ft(x)f_t(x) is the score of the leaf where input xx belongs to in the tree tt. Let wRT\bm w \in \mathbb R^T be the vector of scores on the leaves of tree tt where TT is the number of leaves. In XGBoost, we define:

ω(f)=γT+12λj=1Twj2\omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2

which is the regularization part of the objective define above. Now the objective is rewritten as

obj(t)i=1n(giw(xi)+12hiw2(xi))+γT+12λj=1Twj2\begin{align*} \text{obj}^{(t)} & \approx \sum_{i=1}^n \Big( g_i w(x_i) + \frac{1}{2}h_i w^2(x_i) \Big) + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 \end{align*}

where w(xi)w(x_i) is the score of the leaf xix_i falls into. Because xix_i in the same leaf jj get the same score wjw_j, we can rearrage this sum as

obj(t)j=1T(Gjwj+12(Hj+λ)wj2)+γT\begin{align*} \text{obj}^{(t)} & \approx \sum_{j=1}^T \Big( G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \Big) + \gamma T \end{align*}

where Gj=xileafjgiG_j = \sum_{x_i \in \text{leaf}_j}g_i and Hj=xileafjhiH_j = \sum_{x_i \in \text{leaf}_j}h_i. Since this objective is quadratic with respect to wjw_j, we can find the optimal leaf score wjw_j^\star that minimizes the objective:

wj=GjHj+λw_j^\star = -\frac{G_j}{H_j+\lambda}

So the minimum value of the objective with respect to leaf scores is

obj=12j=1TGj2Hj+λ+γT(✠)\text{obj}^\star = - \frac{1}{2}\sum_{j=1}^T\frac{G_j^2}{H_j+\lambda} + \gamma T \tag{\maltese}

So if we have the tree structure, then we have GjG_j and HjH_j from which we get the optimal leaf scores for that tree structure. Having this, we can only compare two trees and say which one is more optimal, i.e. gives smaller objective value (or smaller residual error). Now what is the best tree? ideally we would enumerate all possible trees and pick the best one. In practice this is intractable, so we will go greedy and try to optimize one level of the tree at a time at every split. According to equation \maltese, the difference in the objective because of a split at a node is:

Gain=12(GL2HL+λ+GR2HR+λ(GL+GR)2HL+HR+λ)γ\text{Gain} = \frac{1}{2} \Big ( \frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L + G_R)^2}{H_L + H_R+\lambda} \Big) - \gamma

If this is positive, the new objective is smaller. However the hyperparameter γ\gamma is the minimum required gain to make a split. In XGBoost, feature importance is calculated base on the total gain obtained by all splits using the feature. These gains are summed (or averaged) per feature over all trees and reported as feature importance. The best feature contributes most to reducing the error. On the other hand, increasing hyperparameter λ\lambda would decrease the leaf scores at every split which in turn, makes splitting a more conservative because less gain would then be obatined from a split. So in XGBoost, splits are not made by impurity or variance like in standard decision trees. Instead, they’re made by directly minimizing the second-order Taylor approximation of the loss.

As it was clear from the objective objt\text{obj}^t, the new tree ftf_t tries to decrease the residual of the previous model yy^(t1)y - \hat y^{(t-1)}. In other words, it reduces the current model's error. The learning rate η\eta controls how much ftf_t contributes to this reduction. So the new tree is added with a shrinkage to the model to imporve it. This how gradient boosting learns sequentially — always training the next model to fix the mistakes of the combined model so far.

Hyperparameter Tuning in XGBoost

Hyperparameter Purpose
max_depth Controls complexity of each tree (↓ depth = ↑ bias)
learning_rate (η) Shrinks update per tree (smaller = better generalization, needs more trees)
n_estimators Total trees to train
subsample Fraction of data used per tree (↓ = regularization)
colsample_bytree Fraction of features used per tree (↓ = less overfitting)
lambda, alpha L2 and L1 regularization on weights
min_child_weight Minimum sum of instance weights (hessian) in a child — prevents small, noisy splits

Reference: Introduction to Boosted Trees

A better alternative to XGBoost feature importance is to use SHAP Values as a more modern and precise way to evaluate feature impartance. It calculates feature attribution per prediction and can give:

SHAP is slower, but more accurate and interpretable, especially for regulated or sensitive domains.

Stacking

Stacking is based on a simple idea: instead of using trivial functions such as hard voting to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation? To train the blender, a common approach is to use a hold-out set. First, the training set is split in two subsets. The first subset is used to train the predictors in the first layer, say 3 predictors. To train the blender, a common approach is to use a hold-out set. This ensures that the predictions are “clean,” since the predictors never saw these instances during training. Now for each instance in the hold-out set, there are three predicted values. We can create a new training set using these predicted values as input features (which makes this new training set three-dimensional), and keeping the target values. The blender is trained on this new training set, so it learns to predict the target value given the first layer’s predictions. It is actually possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression, and so on): we get a whole layer of blenders. The trick is to split the training set into three subsets.

🔧 Example:

Voting Ensemble

Combine predictions of several models without training a meta-model.
Types:

from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(estimators=[
    ('lr', LogisticRegression()),
    ('rf', RandomForestClassifier()),
    ('svc', SVC(probability=True))
], voting='soft')

Support Vector Machine

SVM is a supervised learning algorithm used for binary classification (and extended to regression or multiclass). It starts with finding a linear classifier so that, given data points from two classes, the hyperplane best separates the two classes with the maximum margin. In SVM, the decision function is

f(x)=sign(wTx+b)f(x) = \text{sign}\big(\bm w^T\bm x+b\big)

If f(x)>0f(x) > 0, xx is classified as +1. If f(x)<0f(x) < 0, xx is classified as -1 in the binary classification of labels {1,1}\{ -1, 1\}.

Hard Margin SVM (Linearly Separable)

Mathematically, the objective becomes:

{maxw,bCs.t.    t(i)(wTx(i)+b)w2Ci=1,,N\begin{cases} \max_{\bm w, b} C \\ \text{s.t.}\;\; \frac{t^{(i)}(\bm w^T\bm x^{(i)}+b)}{||\bm w||_2} \ge C &\forall i= 1,\dots, N \end{cases}

where ti{1,1}t_i \in \{ -1, 1\}. Because the left side is NOT dependent on the length of w2||\bm w||_2, whatever the optimal value of CC is, we can write C=1w2C = \frac{1}{||\bm w||_2} for some w2||\bm w||_2. Therefore the above optimization objective is equivalent to

{minw,bw22s.t.    t(i)(wTx(i)+b)1i=1,,N\begin{cases} \min_{\bm w, b} ||\bm w||^2_2 \\ \text{s.t.}\;\; t^{(i)}(\bm w^T\bm x^{(i)}+b) \ge 1 &\forall i= 1,\dots, N \end{cases}

Note that distant points x(i)\bm x^{(i)} to the line do not affect the solution of this problem so, we could remove them from the training set and the optimal w\bm w would be the same. The important training examples are the ones with algebraic margin 1, and are called support vectors. Hence, this algorithm is called the hard Support Vector Machine (SVM) (or Support Vector Classifier). SVM-like algorithms are often called max-margin or large-margin. The support vectors in hard SVM lie exactly on the margin boundaries:

t(i)(wTx(i)+b)=1t^{(i)}(\bm w^T\bm x^{(i)}+b) = 1

Removing a support vector would change the decision boundary. How can we apply the max-margin principle if the data are not linearly separable?

Soft Margin SVM (Real-World, Noisy Data)

The strategy for solving this is to:

drawing

So the soft margin constraint could be expressed as:

{minw,b,ξw22+γiξis.t.    t(i)(wTx(i)+b)1ξii=1,,Nξi0i=1,,N(†)\begin{equation*} \begin{cases} \min_{\bm w, b, \bm \xi} ||\bm w||^2_2 + \gamma \sum_i \xi_i \\ \text{s.t.}\;\; t^{(i)}(\bm w^T\bm x^{(i)}+b) \ge 1-\xi_i &\forall i= 1,\dots, N \\ \xi_i \ge 0 &\forall i= 1,\dots, N \end{cases} \end{equation*}\tag{\dag}

We can simplify the soft margin constraint by eliminating ξiξ_i. The constraint can be rewritten as ξi1t(i)(wTx(i)+b)\xi_i \ge 1-t^{(i)}(\bm w^T\bm x^{(i)}+b). So:

In fact ξi=max(0,1t(i)(wTx(i)+b))\xi_i = \max\big (0, 1- t^{(i)}(\bm w^T\bm x^{(i)}+b)\big) is the solution for ξ\xi. Therefore our objective is now summarizes to the following:

minw,bi=1Nmax(0,1t(i)(wTx(i)+b))+12γw22\min_{\bm w, b} \sum_{i=1}^N \max\big (0, 1- t^{(i)}(\bm w^T\bm x^{(i)}+b)\big) + \frac{1}{2\gamma} ||\bm w||^2_2

The loss function L(y,t)=max(0,1ty)L(y,t) = \max(0, 1-ty) is called the hinge loss. The second term is the L2-norm of the weights. Hence, the soft-margin SVM can be seen as a linear classifier with hinge loss and an L2 regularizer.

Dual Form & Kernel Trick

The Lagrange (primal) function corresponding to the optimization objective (\dag) is:

Lp=w22+γiξiiαi(t(i)(wTx(i)+b)(1ξi))iμiξiL_p = ||\bm w||^2_2 + \gamma \sum_i \xi_i - \sum_i \alpha_i \big( t^{(i)}(\bm w^T\bm x^{(i)}+b) - (1-\xi_i) \big) - \sum_i\mu_i\xi_i

which we minimize with respect to w,b\bm w, b and ξi\xi_i. Setting the respective derivatives to zero, we get

w=iαit(i)x(i)0=iαit(i)αi=γμi\begin{align} \bm w & = \sum_i \alpha_it^{(i)} \bm x^{(i)} \\ 0 & = \sum_i \alpha_i t^{(i)} \\ \alpha_i & = \gamma - \mu_i \end{align}

as well as the positivity constraints αi,µi,ξi0α_i, µ_i, ξ_i ≥0 for all ii. By substituting the above equations into LPL_P, we obtain the Lagrangian (Wolfe) dual objective function

maxαi=1nαi12i=1Nj=1Nαiαjt(i)t(j)x(i)Tx(j)\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i\alpha_j t^{(i)} t^{(j)} \bm {x^{(i)}}^T \bm {x^{(j)}}

In addition to (1)–(3), the Karush–Kuhn–Tucker conditions include the constraints

αi(t(i)(wTx(i)+b)(1ξi))=0μiξi=0t(i)(wTx(i)+b)(1ξi)0\begin{align} \alpha_i \big( t^{(i)}(\bm w^T\bm x^{(i)}+b) - (1-\xi_i)\big) & = 0\\ \mu_i\xi_i & = 0\\ t^{(i)}(\bm w^T\bm x^{(i)}+b) - (1-\xi_i) & \ge 0 \end{align}

From equation (1), we see that the solution for w\bm w has the form

w^=iα^it(i)x(i)\begin{align} \bm {\hat w} = \sum_i \hat \alpha_it^{(i)} \bm x^{(i)} \end{align}

with α^i>0\hat α_i > 0 only for those observations ii for which

t(i)(w^Tx(i)+b^)=1ξ^it^{(i)}(\bm {\hat w}^T\bm x^{(i)} + \hat b) = 1- \hat \xi_i

otherwise condition (4) implies α^i=0\hat α_i = 0. These observations are called the support vectors, since the solution w^\bm {\hat w} only depends on them. Among these support points, some will lie on the edge of the margin ξ^i=0\hat ξ_i = 0, some inside the margin 0<ξ^i<10 < \hat ξ_i < 1, and others are misclassified ξ^i>1\hat ξ_i > 1 and outside of the margin on the wrong side of the decision boundary.

Support Vector Machines and Kernels

The support vector classifier described so far finds linear boundaries in the input feature space. As with other linear methods, we can make the procedure more flexible by enlarging the feature space using basis expansions such as polynomials or splines. Generally linear boundaries in the enlarged space achieve better training-class separation, and translate to nonlinear boundaries in the original space. Once the basis functions hm(x),m=1,...,Mh_m(\bm x), m= 1,...,M are selected, the procedure is the same as before. We fit the SV classifier using input features h(xi)=(h1(xi),h2(xi),...,hM(xi))h(\bm x_i) = (h_1(\bm x_i),h_2(\bm x_i),...,h_M (\bm x_i)), i=1,...,Ni = 1,...,N, and produce the (nonlinear) function f(x)=wTh(x)+bf(\bm x) = \bm w^Th(\bm x) + b. The classifier is signf(x)\text{sign} f(\bm x) as before. For example, if x=(x1,x2)\bm x = (x_1,x_2) then hh could be defined as

h(x)=(x12,2x1x2,x22,2x1,2x2,1)h(\bm x) = (x_1^2, \sqrt{2}x_1x_2, x_2^2, \sqrt{2}x_1, \sqrt{2}x_2,1)

which is a mapping into 6-dim space. Then we can calculate

K(x(i),x(j))=<h(x(i)),h(x(j))>=(<x(i),x(j)>+1)2\begin{align*} K(\bm x^{(i)}, \bm x^{(j)}) & = \left < h(\bm x^{(i)}) , h(\bm x^{(j)}) \right> = \Big(\left < \bm x^{(i)} , \bm x^{(j)} \right> +1 \Big)^2 \end{align*}

The SVM optimization problem can be directly expressed for the transformed feature vectors h(xi)h(\bm x_i) so these inner products can be computed very cheaply. The Lagrange dual function has the form

i=1nαi12i=1Nj=1Nαiαjt(i)t(j)h(x(i))Th(x(j))\sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^N\sum_{j=1}^N \alpha_i\alpha_j t^{(i)} t^{(j)} h(\bm x^{(i)})^Th(\bm x^{(j)})

The SVM classifier on the feature mapping into 6-dim space has the solution f(x)f(x) that can be written as the following using equation (7):

f(x)=wTh(x)+b=iα^it(i)h(x(i))Th(x)+b^=iα^it(i)K(x(i),x)+b^\begin{align*} f(x) = \bm w^T h(\bm x) + b & = \sum_i \hat \alpha_i t^{(i)} h(\bm x^{(i)})^Th(\bm x) + \hat b\\ & = \sum_i \hat \alpha_i t^{(i)} K(\bm x^{(i)}, \bm x) + \hat b \end{align*}

In fact, we need not specify the transformation h(x)h(\bm x) at all, but require only knowledge of the kernel function K(x(i),x(j))=<h(x(i)),h(x(j))>K(\bm x^{(i)}, \bm x^{(j)}) = \left< h(\bm x^{(i)}), h(\bm x^{(j)})\right> that computes inner products in the transformed space. KK should be a symmetric positive (semi-) definite function. Three popular choices for KK in the SVM literature are:

Neural Networks

The models we have seen are making some assumption about the data. Although they are powerful but still have limited power to handle many important tasks due to the complexity of data or the tasks. To expand their capabilities, they might require us to create new features to help them learn more complex and general patterns. Inspired by how human brain works or learn, neural networks created to overcome these difficulties more efficiently. In a basic neural network model, first we construct MM linear combinations of the input variables x1,...,xDx_1, . . . , x_D in the form

aj(xi)=i=1Dwji(1)xi+wj0(1)a_j(x_i) = \sum_{i=1}^D w^{(1)}_{ji} x_i + w^{(1)}_{j0}

where j=1,...,Mj = 1, . . . , M, and the superscript (1) indicates that the corresponding parameters are in the first layer of the network. We shall refer to the parameters wji(1)w^{(1)}_{ji} as weights and the parameters wj0(1)w^{(1)}_{j0} as biases. The quantities aja_j are known as activations. Each of them is then transformed using a differentiable, nonlinear activation function h()h(·) to give

zj(xi)=h(aj(xi)).z_j(x_i) = h(a_j(x_i)).

These quantities correspond to the outputs of the basis functions that, in the context of neural networks, are called hidden units. These units in the middle of the network, computing the derived features, are called hidden units because the values zjz_j are not directly observed. The nonlinear functions h()h(·) are generally chosen to be sigmoidal functions such as the logistic sigmoid or the tanh function. These values are again linearly combined to give output unit activations:

ak(xi)=i=1Mwki(2)zi(xi)+wk0(2)a_k(x_i) = \sum_{i=1}^M w^{(2)}_{ki} z_i(x_i) + w^{(2)}_{k0}

where k=1,...,Kk = 1, . . . , K, and KK is the total number of outputs. This transformation corresponds to the second layer of the network. A multilayer network consisting of fully connected layers is called a multilayer perceptron (MLP). Finally, the output unit activations are transformed using an appropriate activation function to give a set of network outputs yk\bm y_k.

Another generalization of the network architecture is to include skip-layer connections, each of which is associated with a corresponding adaptive parameter. Furthermore, the network can be sparse, with not all possible connections within a layer being present. While we can develop more general network mappings by considering more complex network diagrams. However, these must be restricted to a feed-forward architecture, in other words to one having no closed directed cycles, to ensure that the outputs are deterministic functions of the inputs.

drawing

Neural networks are therefore said to be universal approximators. For example, a two-layer network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy provided the network has a sufficiently large number of hidden units. This result holds for a wide range of hidden unit activation functions, but excluding polynomials. Neural nets can be viewed as a way of learning features: the hidden units in the middle layers are the learned features that lead to the final output of the net. Suppose we’re trying to classify images of handwritten digits. Each image is represented as a vector of 28 ×28 = 784 pixel values. Each first-layer hidden unit computes σ(wiTx)σ(\bm w^T_i \bm x). It acts as a feature detector. We can visualize w\bm w by reshaping it into an image to see that each image is a small learned feature such as edges of the original image.

We can connect lots of units together into a directed acyclic graph. This gives a feed-forward neural network. That’s in contrast to recurrent neural networks, which can have cycles. Typically, units are grouped together into layers. Each layer connects N input units to M output units. In the simplest case, all input units are connected to all output units. We call this a fully connected layer. Layer structure of neural networks provide modularity: we can implement each layer’s computations as a black box and then combine or stack them as need. Some common activation functions are:

The rate of activation of the sigmoid depends on the norm of ak\bm a_k, and if ak||\bm a_k|| is very small, the unit will indeed be operating in the linear part of its activation function.

drawing

The choice of activation function is determined by the nature of the data and the assumed distribution of target variables. Sigmoid and tanh squash inputs to small ranges making their derivatives become tiny for large inputs (i.e., vanishing gradients). But with ReLU, the gradient is strong and doesn’t vanish for x>0x>0. It is very fast to compute and empirically, ReLU often leads to faster training and better local minima in deep networks compared to sigmoid or tanh. On the other hand, If a neuron’s input is always negative, its output is always 0, and its gradient is 0 — it never learns. To fix this, Leaky ReLU is used: small slope for x<0x<0, e.g., LeakyReLU(x)=max(0.01x,x)\text{LeakyReLU}(x)=\max(0.01x,x). Since the output of ReLU is not bounded, gradient can explode in deep nets if not normalized properly which is why BatchNormalization becomes useful here.

If the activation functions of all the hidden units in a network are taken to be linear (or removed), then the entire model collapses to a linear model in the inputs. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. In fact, networks of only linear units give rise to principal component analysis. Hence a neural network can be thought of as a nonlinear generalization of the linear model, both for regression and classification.

First, MLPs can be used for regression tasks. In genral, for standard regression problems, the activation function is the identity so that yk=ak\bm y_k =\bm a_k, regressing KK targets in multivariate regression. If you want to predict a single value (e.g., the price of a house given many of its features), then you just need a single output neuron: its output is the predicted value. In general, when building an MLP for regression, you do not want to use any activation function for the output neurons, so they are free to output any range of values. However, if you want to guarantee that the output will always be positive, then you can use the ReLU activation function, or the softplus activation function in the output layer. Finally, if you want to guarantee that the predictions will fall within a given range of values, then you can use the logistic function or the hyperbolic tangent, and scale the labels to the appropriate range: 0 to 1 for the logistic function, or –1 to 1 for the hyperbolic tangent.

The loss function to use during training is typically the mean squared error, but if you have a lot of outliers in the training set, you may prefer to use the mean absolute error instead. Alternatively, you can use the Huber loss, which is a combination of both. The Huber loss is quadratic when the error is smaller than a threshold δδ (typically 1), but linear when the error is larger than δδ. This makes it less sensitive to outliers than the mean squared error, and it is often more precise and converges faster than the mean absolute error.

MLPs can also be used for classification tasks. For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. Obviously, the estimated probability of the negative class is equal to one minus that number. MLPs can also easily handle multilabel binary classification tasks

Similarly, for multiple binary classification problems, each output unit activation is transformed using a logistic sigmoid function so that yk=σ(ak)\bm y_k = σ(\bm a_k). For KK-class multiclass classification, there are KK units at the top, with the kkth unit modeling the probability of class kk using softmax activation function. There are KK target measurements tkt_k, k=1,...,Kk= 1,...,K, each being coded as a 0−1 variable for the kkth class and the corresponding classifier is C(x)=arg maxkyk(xi)C(x) = \argmax_k y_k(x_i). With the softmax activation function and the cross-entropy error function, the neural network model is exactly a linear logistic regression model in the hidden units, and all the parameters are estimated by maximum likelihood.

If the training set was very skewed, with some classes being overrepresented and others underrepresented, it would be useful to set the class_weight argument when calling the fit() method, giving a larger weight to underrepresented classes, and a lower weight to overrepresented classes. These weights would be used by Keras when computing the loss. If you need per-instance weights instead, you can set the sample_weight argument (it supersedes class_weight). This could be useful for example if some instances were labeled by experts while others were labeled using a crowdsourcing platform: you might want to give more weight to the former.

Fitting Neural Networks

The neural network model has unknown weights and we seek values for them that make the model fit the training data well, i.e, minimizing the error function. For regression, we use sum-of-squared errors as our measure of fit (error function):

E(w)=k=1Ki=1N(yk(xi)tk(xi))2E(\bm w) = \sum_{k=1}^K \sum_{i=1}^N (y_k(x_i) - t_k(x_i))^2

For classification we use either squared error or cross-entropy (deviance):

E(w)=k=1Ki=1Ntk(xi)logyk(xi)E(\bm w) = - \sum_{k=1}^K \sum_{i=1}^N t_k(x_i)\log y_{k}(x_i)

Because there is clearly no hope of finding an analytical solution to the equation wE(w)=0∇_{\bm w} E(\bm w) = 0 we resort to iterative numerical procedures. Most techniques involve choosing some initial value w(0)\bm w(0) for the weight vector and then moving through weight space in a succession of steps of the form w(τ+1)=w(τ)+w(τ)\bm w^{(τ+1)} = \bm w^{(τ)} + ∆\bm w^{(τ)} where ττ labels the iteration step. Different algorithms involve different choices for the weight vector update w(τ)∆\bm w^{(τ)}. Many algorithms make use of gradient information and therefore require that, after each update, the value of E(w)∇E(\bm w) is evaluated at the new weight vector w(τ+1)w^{(τ+1)}. The generic approach to minimizing E(w)E(\bm w) is by gradient descent, called backpropagation in this setting. Because of the compositional form of the model, the gradient can be easily derived using the chain rule for differentiation layer by layer from the output of the network toward the beginning. This is done by a forward and backward sweep over the network, keeping track only of quantities local to each unit.

Backpropogation

Backpropogation can be implemented with a two-pass algorithm. In the forward pass, the current weights are fixed and the predicted values yk(xi)y_k (x_i) are computed. In the backward pass, the errors w(τ)∆\bm w^{(τ)} are computed, and then backpropagated to calculate the gradient. The gradient updates are a kind of batch learning, with the parameter updates being a sum over all of the training cases. Learning can also be carried out online—processing each observation one at a time, updating the gradient after each training case, and cycling through the training cases many times. A training epoch refers to one sweep through the entire training set. Online training allows the network to handle very large training sets, and also to update the weights as new observations come in.

As an example of how backpropogation helps computing gradient, we try to compute the cost gradient dJ/dwdJ/d\bm w, which is the vector of partial derivatives. This is the average of dL/dwdL/d\bm w over all the training examples, so in this lecture we focus on computing dL/dwdL/d\bm w. Take one layer perceptron as an example with regularizer:

z=wx+by=σ(z)L=12(yt)2R=12w2LR=L+λR\begin{align*} z & = w x + b\\ y & = \sigma(z)\\ L & = \frac{1}{2}(y-t)^2\\ R &= \frac{1}{2} w^2\\ L_R & = L + \lambda R \end{align*}

We can diagram out the computations using a computation graph. The nodes represent all the inputs and computed quantities, and the edges represent which nodes are computed directly as a function of which other nodes.

flowchart LR Start[ ] ----->|Compute Loss| End[ ] id1(($$x$$)) --> id4(($$z$$)) id2(($$b$$)) --> id4(($$z$$)) id3(($$w$$)) --> id4(($$z$$)) id7((t)) --> id6(($$L$$)) id4(($$z$$)) --> id5(($$y$$)) id5(($$y$$)) --> id6(($$L$$)) id6(($$L$$)) --> id9(($$L_R$$)) id3(($$w$$)) --> id8(($$R$$)) id8((R)) --> id9(($$L_R$$)) style Start fill-opacity:0, stroke-opacity:0; style End fill-opacity:0, stroke-opacity:0;

The forward pass is straightforward. For now assume that we are dealing with single variables and single-value functions. The backward pass goes according to the chain rule:

Notation: y˙=LRy\dot y = \frac{\partial L_R}{\partial y}

L˙R=1R˙=L˙RLRR=L˙RλL˙=L˙RLRL=L˙Ry˙=L˙Ly=L˙(yt)z˙=y˙yz=y˙σ(z)b˙=z˙zb=z˙w˙=z˙zw+R˙Rw=z˙x+R˙w\begin{align*} \dot L_R & = 1\\ \dot R & = \dot L_R \frac{\partial L_R}{\partial R} = \dot L_R \lambda \\ \dot L & = \dot L_R \frac{\partial L_R}{\partial L} =\dot L_R \\ \dot y & = \dot L \frac{\partial L}{\partial y} = \dot L (y-t)\\ \dot z & = \dot y \frac{\partial y}{\partial z} = \dot y \sigma'(z) \\ \dot b & = \dot z \frac{\partial z}{\partial b} = \dot z \\ \dot w & = \dot z \frac{\partial z}{\partial w} + \dot R \frac{\partial R}{\partial w} = \dot z x + \dot R w \end{align*}

To perform these computations efficiently, we need to vectorize these computations in matrix form. For a fully connected layer connection zyz \rightarrow y, the backprop rules will be:

z˙j=iy˙iyizj\dot z_j = \sum_i \dot y_i \frac{\partial y_i}{\partial z_j}

flowchart LR id1(($$z_1$$)) --> id2(($$y_1$$)) id3(($$z_2$$)) --> id4(($$y_2$$)) id5(($$z_3$$)) --> id6(($$y_3$$)) id1(($$z_1$$)) --> id4(($$y_2$$)) id1(($$z_1$$)) --> id6(($$y_3$$)) id3(($$z_2$$)) --> id2(($$y_1$$)) id3(($$z_2$$)) --> id6(($$y_3$$)) id5(($$z_3$$)) --> id4(($$y_2$$)) id5(($$z_3$$)) --> id2(($$y_1$$))

which looks like the following in matrix form:

z˙=yzTy˙\dot {\bm z} = \frac{\partial \bm y}{\partial \bm z}^T \dot {\bm y}

where

yz=(y1z1y1znymz1ymzn)\frac{\partial \bm y}{\partial \bm z} = \begin{pmatrix} \frac{\partial y_1}{\partial z_1} &\dots &\frac{\partial y_1}{\partial z_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial z_1} &\dots &\frac{\partial y_m}{\partial z_n} \end{pmatrix}

is the Jacobian matrix. So If z=Wx\bm z = \bm W \bm x, then zx=W\frac{\partial \bm z}{\partial \bm x} = \bm W and x˙=WTz˙\dot {\bm x} = \bm W^T \dot {\bm z}. Backprop is used to train the overwhelming majority of neural nets today. Check your derivatives numerically by plugging in a small value of h. This is known as finite differences.

xif(x1,,xN)=limh0f(x1,,xi+h,,xN)f(x1,,xih,,xN)2h\frac{\partial }{\partial x_i} f(x_1, \dots, x_N) = \lim_{h \rightarrow 0} \frac{f(x_1,\dots, x_i+h,\dots, x_N) - f(x_1,\dots, x_i-h,\dots, x_N)}{2h}

Run gradient checks on small, randomly chosen inputs. Use double precision floats (not the default for TensorFlow, PyTorch, etc.!), Compute the relative error: ab/(a+b)|a−b| / (|a|+ |b|). The relative error should be very small, e.g. 10610^{−6}. Gradient checking is really important! Learning algorithms often appear to work even if the math is wrong.

Neural Nets: Non-convex Optimization

Training a network with hidden units cannot be convex because of permutation symmetries. Suppose you have a simple feed-forward network: x → Hidden Layer → Output. Let’s say the hidden layer has 3 neurons. Each neuron applies:

hj=σ(wjTx+bj)h_j = \sigma(w_j^T x + b_j)

Then you compute:

y^=j=13vjhj+c\hat y = \sum_{j=1}^3 v_j h_j + c

Now suppose you pick two neurons 1, 2 in hidden layer, swap their corresponding incoming weights w1w2w_1 \leftrightarrow w_2 and b1b2b_1 \leftrightarrow b_2, and then swap their outgoing weights v1v2v_1 \leftrightarrow v_2 (their outgoing weights). The overall function computed by the network remains the same and the network's output does not change ss long as all connections are permuted consistently. This implied many different parameter configurations represent the same function. These configurations have the same loss. So the loss surface has many symmetric minima — but they are functionally identical. That is not the case for convex loss function as there is only one global optima for convex functions. On the other hand, suppose we average the parameters of any two of these permuted parameter configuration permutations and substitute this average value for all the parameters we averaged. We get a model that is a degenerate because all the hidden units are identical. On the other hand, if the loss was a convex optimization problem, we should have obtained smaller loss for this degenerate configuration which is absurd. Hence, training multilayer neural nets is non-convex. Permutation symmetries imply that the loss surface has many equivalent minima. These are not isolated points — they’re connected by symmetrical transformations.

Training Neural Networks

There is quite an art in training neural networks. The model is generally overparametrized, and the optimization problem is nonconvex and unstable unless certain guidelines are followed. Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. This is called the vanishing gradients problem. In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem, which is mostly encountered in recurrent neural networks. Looking at the logistic activation function, you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus when backpropagation kicks in, it has virtually no gradient to propagate back through the network, and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.

We need the signal to flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don’t want the signal to die out, nor do we want it to explode and saturate. For the signal to flow properly, the researchers argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

Random Initialization

Random Initialization: the connection weights of each layer must be initialized randomly. this is called Xavier initialization. By default, Keras uses this initialization with a uniform distribution. Note that the use of exact zero weights leads to zero derivatives and perfect symmetry, and the algorithm never moves. Starting instead with large weights often leads to poor solutions.

Vanishing/Exploding Gradients

One of the insights in the 2010 paper by Glorot and Bengio was that the vanishing/exploding gradients problems were in part due to a poor choice of activation function. But it turns out that other activation functions than sigmoid behave much better in deep neural networks, in particular the ReLU activation function, mostly because it does not saturate for positive values (and also because it is quite fast to compute). Unfortunately, the ReLU activation function is not perfect. It suffers from a problem known as the dying ReLUs: during training, some neurons effectively die, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting 0s, and gradient descent does not affect it anymore since the gradient of the ReLU function is 0 when its input is negative. To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU. This function is defined as LeakyReLU(z)=max(αz,z)\text{LeakyReLU}(z) = \max(αz, z)

Batch Normalization

Batch Normalization (BN) was a technique to address the vanishing/exploding gradients problems. The technique consists of adding an operation in the model just before or after the activation function of each hidden layer, simply zero-centering and normalizing each input, then scaling and shifting the result using two new parameter vectors per layer: one for scaling, the other for shifting. In other words, this operation lets the model learn the optimal scale and mean of each of the layer’s inputs. In many cases, if you add a BN layer as the very first layer of your neural network, you do not need to standardize your training set (e.g., using a StandardScaler): the BN layer will do it for you (well, approximately, since it only looks at one batch at a time, and it can also rescale and shift each input feature).In order to zero-center and normalize the inputs, the algorithm needs to estimate each input’s mean and standard deviation. It does so by evaluating the mean and standard deviation of each input over the current mini-batch.

μB=1mBi=1mBxiσB2=1mBi=1mB(xiμB)2xi^=xiμBσB2+ϵzi=γxi^+β\begin{align*} \bm \mu_B & = \frac{1}{m_B} \sum_{i=1}^{m_B}\bm x_i\\ \bm \sigma_B^2 & = \frac{1}{m_B} \sum_{i=1}^{m_B} (\bm x_i - \bm \mu_B)^2 \\ \hat {\bm x_i} & = \frac{\bm x_i - \bm \mu_B}{\sqrt{\bm \sigma_B^2 + \epsilon}} \\ \bm z_i &= \bm \gamma \otimes \hat{\bm x_i} + \bm \beta \end{align*}

where mBm_B is the mini-batch size and \otimes represents element-wise multiplication. It was reported that BN is leading to a huge improvement in the ImageNet classification task (ImageNet is a large database of images classified into many classes and commonly used to evaluate computer vision systems). The vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions such as the tanh and even the logistic activation function. The networks were also much less sensitive to the weight initialization. They were able to use much larger learning rates, significantly speeding up the learning process.

You may find that training is rather slow, because each epoch takes much more time when you use batch normalization. However, this is usually counterbalanced by the fact that convergence is much faster with BN, so it will take fewer epochs to reach the same performance. All in all, wall time will usually be smaller (this is the time measured by the clock on your wall). Batch Normalization has become one of the most used layers in deep neural networks, to the point that it is often omitted in the diagrams, as it is assumed that BN is added after every layer.

Clipping Gradients

Another popular technique to lessen the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient Clipping. This technique is most often used in recurrent neural networks, as Batch Normalization is tricky to use in RNNs. This will clip every component of the gradient vector to a value between –1.0 and 1.0.

Number of Hidden Units and Layers

It is most common to put down a reasonably large number of units and train them with regularization. Choice of the number of hidden layers is guided by background knowledge. and experimentation. Each layer extracts features of the input for regression or classification. Use of multiple hidden layers allows construction of hierarchical features at different levels of resolution.

Regularization in Neural Networks

Note that MM, the number of hidden units controls the number of parameters (weights and biases) in the network, and so we might expect that in a maximum likelihood setting there will be an optimum value of MM that gives the best generalization performance, corresponding to the optimum balance between under-fitting and over-fitting. The generalization error, however, is not a simple function of MM due to the presence of local minima in the error function. Here we see the effect of choosing multiple random initializations for the weight vector for a range of values of MM. The overall best validation set performance in this case occurred for a particular solution having M=8M = 8. In practice, one approach to choosing MM is in fact to plot a graph of the kind shown below and then to choose the specific solution having the smallest validation set error.

drawing

Examples of two-layer networks trained on 1010 data points drawn from the sinusoidal dataset. The graphs show the result of fitting networks having M=1,3M = 1, 3 and 1010 hidden units, respectively, by minimizing a sum-of-squares error function using a scaled conjugate-gradient algorithm. There are, however, other ways to control the complexity of a neural network model in order to avoid over-fitting. The simplest regularizer is the quadratic, giving a regularized error. The simplest regularizer is the quadratic, giving a regularized error of the form

E~(w)=E(w)+λ2wTw.\tilde E(\bm w) = E(\bm w) + \frac{\lambda}{2} \bm w^T \bm w.

This regularizer is also known as weight decay and has been discussed at length. Larger values of λλ will tend to shrink the weights toward zero: typically cross-validation is used to estimate λ. The effective model complexity is then determined by the choice of the regularization coefficient λλ. As we have seen previously, this regularizer can be interpreted as the negative logarithm of a zero-mean Gaussian prior distribution over the weight vector w\bm w. You can use ℓ1 and ℓ2 regularization to constrain a neural network’s connection weights (but typically not its biases). Here is how to apply ℓ2 regularization to a Keras layer’s connection weights, using a regularization factor of 0.01. The l2() function returns a regularizer that will be called to compute the regularization loss, at each step during training. This regularization loss is then added to the final loss.

Dropout

Dropout is one of the most popular regularization techniques for deep neural networks. It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability pp of being temporarily “dropped out,” meaning its output is set to zero during this training step and no gradients flow through it., but it may be active during the next training step. So it does not contribute to forward pass or backpropagation for that training step. This happens only during training. The hyperparameter pp is called the dropout rate, and it is typically set to 50%. At inference time, all neurons are used normally. Dropout

If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. It can also help to increase the dropout rate for large layers, and reduce it for small ones. Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong. Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time and effort.

Early stopping

Often neural networks have too many weights and will overfit the data at the global minimum of R. In early developments of neural networks, either by design or by accident, an early stopping rule was used to avoid overfitting. The training of nonlinear network models corresponds to an iterative reduction of the error function defined with respect to a set of training data. For many of the optimization algorithms used for network training, such as conjugate gradients, the error is a nonincreasing function of the iteration index. However, the error measured with respect to independent data, generally called a validation set, often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to the validation dataset in order to obtain a network having good generalization performance. The behaviour of the network in this case is sometimes explained qualitatively in terms of the effective number of degrees of freedom in the network, in which this number starts out small and then to grows during the training process, corresponding to a steady increase in the effective complexity of the model.

Faster Optimizers

Training a very large deep neural network can be painfully slow. So far we have seen four ways to speed up training (and reach a better solution): applying a good initialization strategy for the connection weights, using a good activation function, using Batch Normalization, and reusing parts of a pretrained network (possibly built on an auxiliary task or using unsupervised learning). Another huge speed boost comes from using a faster optimizer than the regular Gradient Descent optimizer. Some popular ones are Adam, AdaGrad, RMSProp.

If you set it way too high, training may actually diverge. If you set it too low, training will eventually converge to the optimum, but it will take a very long time. If you set it slightly too high, it will make progress very quickly at first, but it will end up dancing around the optimum, never really settling down. If you have a limited computing budget, you may have to interrupt training before it has converged properly, yielding a suboptimal solution.

Learning Rate, Batch Size and other Hyperparameters

The number of hidden layers and neurons are not the only hyperparameters you can tweak in an MLP. Here are some of the most important ones, and some tips on how to set them:

The learning rate is arguably the most important hyperparameter. In general, the optimal learning rate is about half of the maximum learning rate. So a simple approach for tuning the learning rate is to start with a large value that makes the training algorithm diverge, then divide this value by 3 and try again, and repeat until the training algorithm stops diverging. The batch size can also have a significant impact on your model’s performance and the training time. In general the optimal batch size will be lower than 32. A small batch size ensures that each training iteration is very fast, and although a large batch size will give a more precise estimate of the gradients, in practice this does not matter much since the optimization landscape is quite complex and the direction of the true gradients do not point precisely in the direction of the optimum. However, having a batch size greater than 10 helps take advantage of hardware and software optimizations, in particular for matrix multiplications, so it will speed up training. Moreover, if you use Batch Normalization, the batch size should not be too small. In general, the ReLU activation function will be a good default for all hidden layers. For the output layer, it really depends on your task. In most cases, the number of training iterations does not actually need to be tweaked: just use early stopping instead.

Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle, then just reuse the lower layers of this network: this is called transfer learning. It will not only speed up training considerably, but will also require much less training data.

The output layer of the original model should usually be replaced since it is most likely not useful at all for the new task, and it may not even have the right number of outputs for the new task. Similarly, the upper hidden layers of the original model are less likely to be as useful as the lower layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task. You want to find the right number of layers to reuse.

Try freezing all the reused layers first (i.e., make their weights non-trainable, so gradient descent won’t modify them), then train your model and see how it performs. Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. The more training data you have, the more layers you can unfreeze. It is also useful to reduce the learning rate when you unfreeze reused layers: this will avoid wrecking their fine-tuned weights.

If you still cannot get good performance, and you have little training data, try dropping the top hidden laye r(s) and freeze all remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plenty of training data, you may try replacing the top hidden layers instead of dropping them, and even add more hidden layers. So why did I cheat? Well it turns out that transfer learning does not work very well with small dense networks: it works best with deep convolutional neural networks, so we will revisit transfer learning.

Mixture Density Networks (Optional)

The goal of supervised learning is to model a conditional distribution p(tx)p(\bm t \mid \bm x), which for many simple regression problems is chosen to be Gaussian. However, practical machine learning problems can often have significantly non-Gaussian distributions. These can arise, for example, with inverse problems in which the distribution can be multimodal, in which case the Gaussian assumption can lead to very poor predictions.

As demonstration, data for this problem is generated by sampling a variable x\bm x uniformly over the interval (0,1)(0, 1), to give a set of values {xn}\{x_n \}, and the corresponding target values tnt_n are obtained by computing the function xn+0.3sin(2πxn)x_n + 0.3 \sin(2πx_n) and then adding uniform noise over the interval (0.1,0.1)(−0.1,0.1). The inverse problem is then obtained by keeping the same data points but exchanging the roles of xx and tt.

drawing

Least squares corresponds to maximum likelihood under a Gaussian assumption. We see that this leads to a very poor model for the highly non-Gaussian inverse problem. We therefore seek a general framework for modelling conditional probability distributions. This can be achieved by using a mixture model for p(tx)p(\bm t\mid \bm x) in which both the mixing coefficients as well as the component densities are flexible functions of the input vector xx, giving rise to the mixture density network. For any given value of x\bm x, the mixture model provides a general formalism for modelling an arbitrary conditional density function p(tx)p(\bm t \mid \bm x). Here we shall develop the model explicitly for Gaussian components, so that
p(tx)=k=1Kp(tx,ck)p(ckx)=k=1Kπk(x)N(tμk(x),σk2(x))p(\bm t\mid \bm x) = \sum_{k=1}^K p(\bm t \mid \bm x, c_k)p(c_k \mid \bm x) = \sum_{k=1}^K π_k(\bm x) \mathcal N (\bm t \mid \bm \mu_k (\bm x), σ^2_k(\bm x))

where ckc_k is component kk. This is an example of a heteroscedastic model since the noise variance on the data is a function of the input vector xx. Instead of Gaussians, we can use other distributions for the components, such as Bernoulli distributions if the target variables are binary rather than continuous. We have also specialized to the case of isotropic covariances for the components, although the mixture density network can readily be extended to allow for general covariance matrices by representing the covariances using a Cholesky factorization. We now take the various parameters of the mixture model, namely the mixing coefficients πk(x)π_k(\bm x), the means µk(x)µ_k(\bm x), and the variances σk2(x)σ^2_k(\bm x), to be governed by the outputs of a conventional neural network that takes x\bm x as its input.

If there are LL components in the mixture model, and if t\bm t has K components, then the network will have LL output unit activations denoted by akπa^π_k that determine the mixing coefficients πk(x)π_k(\bm x), K outputs denoted by akσa^σ_k that determine the kernel widths σk(x)σ_k(\bm x), and L × K outputs denoted by akjµa^µ_{kj} that determine the components µkj(x)µ_{kj}(\bm x) of the kernel centres µk(x)µ_k(\bm x). The total number of network outputs is given by (K + 2) L, as compared with the usual K outputs for a network, which simply predicts the conditional means of the target variables. The mixing coefficients must satisfy the constraints

k=1Kπk(x)=1\sum_{k=1}^K \pi_k(\bm x) = 1

which can be achieved using a set of softmax outputs:

πk(x)=eakπl=1Kealπ\pi_k(\bm x) = \frac{e^{a^{\pi}_k}}{\sum_{l=1}^K e^{a^{\pi}_l}}

Similarly, the variances must satisfy σk2(x)0σ^2_k(\bm x) \geq 0 and so can be represented in terms of the exponentials of the corresponding network activations using σk(x)=eakσ\sigma_k(\bm x) = e^{a^{\sigma}_k}. Because the means µk(x)\bm µ_k(\bm x) have real components, they can be represented directly by the network output activations μkj(x)=akjμ\mu_{kj}(\bm x) = a^{\mu}_{kj}. The adaptive parameters of the mixture density network comprise the vector w\bm w of weights and biases in the neural network, that can be set by maximum likelihood, or equivalently by minimizing an error function defined to be the negative logarithm of the likelihood. For independent data, this error function takes the form

E(w)=k=1Nln(k=1kπk(x,w)N(tnμk(xn,w),σk2(xn,w)))E(\bm w) =− \sum_{k=1}^N \ln \Big( \sum_{k=1}^k \pi_k(\bm x, \bm w) \mathcal N (\bm t_n \mid \bm \mu_k(\bm x_n, \bm w), \sigma^2_k(\bm x_n, \bm w)) \Big)

where we have made the dependencies on w\bm w explicit.

drawing

Plot of the mixing coefficients πk(x)π_k(\bm x) as a function of x\bm x for the three kernel functions in a mixture density network trained on the data. The model has three Gaussian components, and uses a two-layer multi-layer perceptron with five ‘tanh’ sigmoidal units in the hidden layer, and nine outputs (corresponding to the 3 means and 3 variances of the Gaussian components and the 3 mixing coefficients). At both small and large values of x\bm x, where the conditional probability density of the target data is unimodal, only one of the kernels has a high value for its prior probability, while at intermediate values of x\bm x, where the conditional density is trimodal, the three mixing coefficients have comparable values. (b) Plots of the means μk(x)\bm \mu_k(\bm x) using the same colour coding as for the mixing coefficients. (c) Plot of the contours of the corresponding conditional probability density of the target data for the same mixture density network. (d) Plot of the approximate conditional mode, shown by the red points, of the conditional density.

Once a mixture density network has been trained, it can predict the conditional density function of the target data for any given value of the input vector. This conditional density represents a complete description of the generator of the data, so far as the problem of predicting the value of the output vector is concerned. One of the simplest of these is the mean, corresponding to the conditional average of the target data, and is given by

E[tx]=tp(tx)dt=k=1Kπk(x)μk(x)\mathbb E [\bm t\mid \bm x] = \int \bm t p(\bm t \mid \bm x) d\bm t= \sum_{k=1}^K π_k(\bm x) \bm \mu_k(\bm x)

We can similarly evaluate the variance of the density function about the conditional average E[tE[tx]2x]\mathbb E [ || \bm t− \mathbb E[\bm t\mid x] ||^2 \mid \bm x] - see p277 in Pattern Recognition and Machine Learning.

Convolutional Neural Networks

Convolutional Neural Networks (LeCun 1989), or ConvNet or CNNs, are a specialized kind of neural network for processing data that has a known, grid-like topology. Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image data, which can be thought of as a 2D grid of pixels. Most commonly, ConvNet architectures make the explicit assumption that the inputs are images. The name convolutional neural network indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers. Convolution of two functions is defined as:

(xw)(t)=x(s)w(ts)ds(x*w)(t) = \int_{-\infty}^{\infty} x(s)w(t-s) ds

Convolution natuarally appears in different areas of Mathematics such as Probability Theory. For example, the pdf of the sum of two random variables is convolution of their pdfs. If xx and ww are defined only on integer tt, we can define the discrete convolution:

(xw)(t)=s=x(s)w(ts)(x*w)(t) = \sum_{s = -\infty}^{\infty} x(s)w(t-s)

The first argument xx to the convolution is often referred to as the input and the second ww argument as the filter (or kernel). The output is sometimes referred to as the feature map. In machine learning applications, the input is usually a multidimensional array of data and the kernel is usually a multidimensional array of parameters that are adapted by the learning algorithm. We will refer to these multidimensional arrays as tensors. Because each element of the input and kernel must be explicitly stored separately, we usually assume that these functions are zero everywhere but the finite set of points for which we store the values. We often use convolutions over more than one axis at a time. For example, if we use a two-dimensional image I as our input, we probably also want to use a two-dimensional kernel:

(IK)(i,j)=mnI(m,n)K(i+m,j+n).(I*K)(i,j) = \sum_m\sum_n I(m,n) K(i+m, j+n).

Also,

IK=KI.I*K = K*I.

In the above formula, the kernel is not flipped (+ not - ) which is the common way to implement convolution in machine learning. It is also rare for convolution to be used alone in machine learning; instead convolution is used simultaneously with other functions, and the combination of these functions does not commute regardless of whether the convolution operation flips its kernel or not. Discrete convolution can be viewed as multiplication by a matrix. However, the matrix has several entries constrained to be equal to other entries. For example, for univariate discrete convolution, each row of the matrix is constrained to be equal to the row above shifted by one element. This usually corresponds to a very sparse matrix (a matrix whose entries are mostly equal to zero) because the kernel is usually much smaller than the input image.

drawing

Convolution leverages three important ideas that can help improve a machine learning system:

Efficiency of Edge Detection

The image below on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left.

drawing

This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Suppose you have a simple image of a square split into white (left) and gray (right) area from the middle. Convolute this matrix with a filter that will turn out to be the vertical edge detector. The following matrix is a representation of this:

[101000101000101000101000][1111]=[020002000200]\begin{bmatrix} 10&10&0&0\\ 10&10&0&0\\ 10&10&0&0\\ 10&10&0&0 \end{bmatrix} * \begin{bmatrix} 1&-1\\ 1&-1 \end{bmatrix}= \begin{bmatrix} 0&20&0\\ 0&20&0\\ 0&20&0 \end{bmatrix}

which is activating the middle pixels vertically in the image as the border between white and gray. If the input image is 320 x 280, the the ouput image will have dimension 319 x 279. To describe the same transformation with a matrix multiplication in a fully connected layer, we would need 320× 280× 319 × 279, or about eight billion entries in the matrix, making convolution two billion times more efficient for representing this transformation using only four parameters of the fiter which is a huge gain computationally. Convolution with a single kernel can only extract one kind of feature, albeit at many spatial locations. Usually we want each layer of our network to extract many kinds of features, at many locations. So we use several filters to capture more information.

Pooling

A typical layer of a convolutional network consists of three stages. In the first stage, the layer performs several convolutions in parallel to produce a set of linear activations. In the second stage, each linear activation is run through a nonlinear activation function, such as the rectified linear activation function. This stage is sometimes called the detector stage. In the third stage, we use a pooling function to modify the output of the layer further. A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs.

drawing

For example, the max pooling operation reports the maximum output within a rectangular neighborhood. Max pooling introduces invariance. The image below demonstrate a view of the middle of the output of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top row shows the outputs of max pooling, with a stride of one pixel between pooling regions and a pooling region width of three pixels. The Bottom is a view of the same network, after the input has been shifted to the right by one pixel. Every value in the bottom row has changed, but only half of the values in the top row have changed, because the max pooling units are only sensitive to the maximum value in the neighborhood, not its exact location.

drawing

Pooling over spatial regions produces invariance to translation, but if we pool over the outputs of separately parametrized convolutions, the features can learn which transformations to become invariant to.

drawing

A pooling unit that pools over multiple features that are learned with separate filters can learn to be invariant to transformations of the input. Each filter attempts to match a slightly different orientation of the 5. When a 5 appears in the input, the corresponding filter will match it and cause a large activation in a detector unit. The max pooling unit then has a large activation regardless of which detector unit was activated. Max pooling over spatial positions is naturally invariant to translation; this multi-channel approach is only necessary for learning other transformations.

Pooling is also used for downsampling. Here we use max-pooling with a pool width of 3 and a stride between pools of 2. This reduces the representation size by a factor of 2, which reduces the computational and statistical burden on the next layer. Note that the rightmost pooling region has a smaller size, but must be included if we do not want to ignore some of the detector units.

drawing

In all cases, pooling helps to make the representation become approximately invariant to small translations of the input. The pooled outputs do not change. Invariance to local translation can be a very useful property if we care more about whether some feature is present than exactly where it is. When determining whether an image contains a face, we need not know the location of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of the face and an eye on the right side of the face. In other contexts, it is more important to preserve the location of a feature. For example, if we want to find a corner defined by two edges meeting at a specific orientation, we need to preserve the location of the edges well enough to test whether they meet.

Other popular pooling functions include the average of a rectangular neighborhood, the L2 norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel. Pooling can complicate some kinds of neural network architectures that use top-down information, such as Boltzmann machines and autoencoders. Convolution and pooling can cause underfitting. If a task relies on preserving precise spatial information, then using pooling on all features can increase the training error. Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs).

Hyperparameters

Three hyperparameters control the size of the output volume of a convlutional layer:

We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.

Suppose that the input volume X has shape X.shape: (11,11,4). Suppose further that we use no zero padding (P=0), that the filter size is F=5, and that the stride is S=2. The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a volume with width and height of 4. The activation map in the output volume (call it V), would then look as follows (only some of the elements are computed in this example):

V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0

Remember that in numpy, the operation * above denotes elementwise multiplication between the arrays. Notice also that the weight vector W0 is the weight vector of that neuron and b0 is the bias. Here, W0 is assumed to be of shape W0.shape: (5,5,4), since the filter size is 5 and the depth of the input volume is 4. Notice that at each point, we are computing the dot product as seen before in ordinary neural networks. Also, we see that we are using the same weight and bias (due to parameter sharing), and where the dimensions along the width are increasing in steps of 2 (i.e. the stride). To construct a second activation map in the output volume, we would have:

V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1
V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1
V[2,0,1] = np.sum(X[4:9,:5,:] * W1) + b1
V[3,0,1] = np.sum(X[6:11,:5,:] * W1) + b1
V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1 (example of going along y)
V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1 (or along both)
...

where we see that we are indexing into the second depth dimension in V (at index 1) because we are computing the second activation map, and that a different set of parameters (W1) is now used. In the example above, we are for brevity leaving out some of the other operations the Conv Layer would perform to fill the other parts of the output array V. Additionally, recall that these activation maps are often followed elementwise through an activation function such as ReLU, but this is not shown here. To summarize, the Conv Layer:

In the output volume, the d-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias. A common setting of the hyperparameters is F=3,S=1,P=1. However, there are common conventions and rules of thumb that motivate these hyperparameters.

Backpropagation

The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). Recall from the backpropagation chapter that the backward pass for a max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass. Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called the switches) so that gradient routing is efficient during backpropagation. This way max pooling layer add no cost for backpropogation.

Converting FC layers to CONV layers

It is worth noting that the only difference between FC (fully connected) and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it’s possible to convert between FC and CONV layers:

FC -> CONV conversion is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7). From there, an AlexNet uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above:

It turns out that this conversion allows us to “slide” the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass. For example, if 224x224 input image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an unput image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers (applied 3 Conv layers: 4096 filters of size each, then another 4096 filter of size 1 each, 1000 filters of size 1 each) would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we’re now getting an entire 6x6 array of class scores across the 384x384 image. Here is the benefit:

Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time but the second option is much more efficient.

Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores. Lastly, what if we wanted to efficiently apply the original ConvNet over the image but at a stride smaller than 32 pixels? We could achieve this with multiple forward passes. For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height.

ConvNet Architectures

We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (Max pool unless stated otherwise) and FC (fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity.

Layer Patterns

The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern:

INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). Or there is a single CONV layer between every POOL layer

INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC

Or we see two CONV layers stacked before a POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation.:

INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC

Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation.

Prefer a stack of small filter CONV to one large receptive field CONV layer. Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have C channels, then it can be seen that the single 7x7 CONV layer would contain C×(7×7×C)=49C2 parameters, while the three 3x3 CONV layers would only contain 3×(C×(3×3×C))=27C2 parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation.

In practice

Use whatever works best on ImageNet. In 90% or more of applications you should not have to worry about these. Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch.

Layer Sizing Patterns

We will first state the common rules of thumb for sizing the architectures and then follow the rules with a discussion of the notation:

The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don’t zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters “work out”, and that the ConvNet architecture is nicely and symmetrically wired. In general:

Case studies

There are several architectures in the field of Convolutional Networks that have a name. The most common are:

drawing

This is called a shortcut because layer +1\ell+1 is totally skipped (skip connection). The authors realized that using residual blocks allow training much deeper nets if they are stacked. This trick can strenthen the backprop gradient signal so convergence is stronger and reduce the loss for much longer interations.

Computational Considerations

The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of:

Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn’t fit, a common heuristic to “make it fit” is to decrease the batch size, since most of the memory is usually consumed by the activations.

Transfer Learning

In practice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. The three major Transfer Learning scenarios look as follows:

  1. ConvNet as fixed feature extractor:

    • Take a ConvNet pretrained on ImageNet
    • Remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet)
    • Treat the rest of the ConvNet as a fixed feature extractor for the new dataset

    In an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. We call these features CNN codes. It is important for performance that these codes are ReLUd (i.e. thresholded at zero) if they were also thresholded during the training of the ConvNet on ImageNet (as is usually the case). Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.

  2. Fine-tuning the ConvNet: The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset. In case of ImageNet for example, which contains many dog breeds, a significant portion of the representational power of the ConvNet may be devoted to features that are specific to differentiating between dog breeds.

  3. Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release their final ConvNet checkpoints for the benefit of others who can use the networks for fine-tuning. For example, the Caffe library has a Model Zoo where people share their network weights.

When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of several factors, but the two most important ones are the size of the new dataset (small or big), and its similarity to the original dataset (e.g. ImageNet-like in terms of the content of images and the classes, or very different, such as microscope images). Keeping in mind that ConvNet features are more generic in early layers and more original-dataset-specific in later layers, here are some common rules of thumb for navigating the 4 major scenarios:

Practical Advice

There are a few additional things to keep in mind when performing Transfer Learning:


References:

Unsupervised Learning: PCA, K-Means, GMM

Many Machine Learning problems involve thousands or even millions of features for each training instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution, as we will see. This problem is often referred to as the curse of dimensionality.

Curse of Dimensionality

It turns out that many things behave very differently in high-dimensional space. For example, if you pick a random point in a unit square (a 1 × 1 square), it will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along any dimension). But in a 10,000-dimensional unit hypercube (a 1 × 1 × ⋯ × 1 cube, with ten thousand 1s), this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close to the border.3

In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances. Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.

An increase in the dimensions means an increase in the number of features. To model such data, we need to increase complexity of the model by increasing the number of parameters. The complexity of functions of many variables can grow exponentially with the dimension, and if we wish to be able to estimate such functions with the same accuracy as function in low dimensions, then we need the size of our training set to grow exponentially as well.

As another simple example, consider a sphere of radius r=1r = 1 in a space of D dimensions, and ask what is the fraction of the volume of the sphere that lies between radius r=1ϵr = 1−ϵ and r=1r = 1. We can evaluate this fraction by noting that the volume of a sphere of radius rr in D dimensions must scale as rrD, and so we write VD(r)=KDrDV_D(r) = K_D r^D where K_D depends on D. Then

VD(1)VD(1ϵ)VD(1)=1(1ϵ)D\frac{V_D(1)-V_D(1-\epsilon)}{V_D(1)} = 1 - (1-\epsilon)^D

which tends to 1 ad D increases. Thus, in spaces of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface! Another similar example: in a high-dimensional space, most of the probability mass of a Gaussian is located within a thin shell at a specific radius. Simialrly, most of density for a multivariate unit uniform distribution is consentrated around the sides of the unit box. This leads to sparse sampling in high dimensions that means all sample points are close to an edge of the sample space.

One more example, consider the nearest-neighbor procedure for inputs uniformly distributed in a dd-dimensional unit hypercube. Suppose we send out a hypercubical neighborhood about a target point to capture a fraction rr of the observations. Since this corresponds to a fraction r of the unit volume, the expected edge length will be ed(r)=r1/de_d(r) = r^{1/d}. In ten dimensions e10(0.01)=0.63e_{10}(0.01) = 0.63 and e10(0.1)=0.80e_{10}(0.1) = 0.80, while the entire range for each input is only 1.0. So to capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local”. Reducing rr dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.

Although the curse of dimensionality certainly raises important issues for pattern recognition applications, it does not prevent us from finding effective techniques applicable to high-dimensional spaces:

Why Reducing Dimensionality?

Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one. For example in image data, two neighboring pixels are often highly correlated: if you merge them into a single pixel (e.g., by taking the mean of the two pixel intensities), you will not lose much information!

Reducing dimensionality does lose some information (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may also make your system perform slightly worse. It also makes your pipelines a bit more complex and thus harder to maintain. So you should first try to train your system with the original data before considering using dimensionality reduction if training is too slow. In some cases, however, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance (but in general it won’t; it will just speed up training). Apart from speeding up training, dimensionality reduction is also extremely useful for data visualization (or DataViz). Reducing the number of dimensions down to two (or three) makes it possible to plot a condensed view of a high-dimensional training set on a graph and often gain some important insights by visually detecting patterns, such as clusters. Moreover, DataViz is essential to communicate your conclusions to people who are not data scientists, in particular decision makers who will use your results.

In dimensionality reduction, we try to learn a mapping to a lower dimensional space that preserves as much information as possible about the input. Dimensionality reduction techniques can save computation/memory, reduce overfitting, help visualize in 2 dimensions.

Main Approaches for Dimensionality Reduction

Linear Dimensionality Reduction (PCA)

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it. PCA identifies the axes called Principal Components that account for the largest amount of variance in the training set. PCA is defined as an orthogonal linear transformation on the feature vectors of a dataset that transforms the data to a new coordinate system such that the transformed feature vectors expand in the directions of the greatest variances and they are uncorrelated. So how can you find the principal components of a training set? Luckily, there is a standard matrix factorization technique called Singular Value Decomposition (SVD) which we will discuss shortly.

Suppose XX is a n×pn\times p data matrix of nn samples with pp features with column-wise zero empirical mean per feature. Otherwise replace rows xi\bm x_i of XX with xiμ\bm x_i - \bm \mu where μ=1/nixi\bm \mu=1/n\sum_i \bm x_i. We are looking for an orthonormal p×pp\times p matrix WW to change the basis of the space into a new basis representing the directions of maximum variances. The columns of WW are the unit basis vectors we are looking for. Note that the sample variance of data along a unit vector w\bm w is Xw2n1\frac{||X\bm w||^2}{n-1}. So our first unit basis vector is obtained as follows:

w1=arg maxwXw2w2=arg maxwwTXTXwwTw\bm w_1 = \argmax_{\bm w} \frac{||X\bm w||^2}{||\bm w||^2} = \argmax_{\bm w} \frac{\bm w^T X^T X\bm w}{\bm w^T \bm w}

A standard result for a positive semidefinite matrix such as XTXX^TX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w\bm w is the corresponding eigenvector. So w1\bm w_1 is the eigenvector of XTXX^TX corresponding to the largest eigenvalue. To find the second max variance unit basis vector, we repeat the same process for the new data matrix XX' whose rows are subtraction od XX rows from w1\bm w_1: X=XXw1w1TX' = X - X\bm w_1\bm w_1^T and so on until we find the ppth vector wp\bm w_p. It turns out that each step gives the remaining eigenvectors of XTXX^TX in the decreasing order of eigenvalues. The basis vectors which are the eigenvectors of XTXX^TX are the principal components. The transformation XWX W maps a data vector xix_i from an original space to a new space of p variables which are uncorrelated over the dataset.

A dimensionality reduction of data XX is obtained selecting the first few columns of XWXW which represent the highest data variations in a smaller feature space of <p<p dimensions. For example, keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains clusters these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram; whereas if any two random directions through the data are chosen, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable.

The explained variance ratio of each principal component indicates the proportion of the dataset’s variance that lies along the axis of each principal component. This can be used to choose the number of dimensions that add up to a sufficiently large portion of the variance, say 95%. There will usually be an elbow in the curve, where the explained variance stops growing fast. You can think of this as the intrinsic dimensionality of the dataset.

drawing

Similarly, in regression analysis, the larger the number of explanatory variables allowed, the greater is the chance of overfitting the model, producing conclusions that fail to generalize. One approach, especially when there are strong correlations between different possible explanatory variables, is to reduce them to a few principal components and then run the regression against them, a method called principal component regression. In machine learning, the orthogonal projection of a data point x\bm x onto the subspace S\mathcal S spanned by a subset of principal components is the point x~S\bm {\tilde x} ∈ \mathcal S closest to x\bm x and is called the reconstruction of x\bm x. Choosing a subspace to maximize the projected variance, or minimize the reconstruction error, is called principal component analysis (PCA).

PCA can be viewed from another point. The matrix 1n1XTX\frac{1}{n-1}X^TX itself is the empirical sample covariance matrix of the dataset. By definition, given a sample consisting of nn independent observations x1,,xn\bm x_1,\dots, \bm x_n of multivariate random variable X\bm X, an unbiased estimator of the (p×p) covariance matrix Σ=E[(XE[X])(XE[X])T]\Sigma = \mathbb E[(\bm X-\mathbb E[\bm X])(\bm X-\mathbb E[\bm X])^T] is the sample covariance matrix

1n1i=1n(xiμ)(xiμ)T.\frac{1}{n-1} \sum_{i=1}^n (\bm x_i - \bm \mu)(\bm x_i - \bm \mu)^T.

Note that every term in the above sum is a p×pp\times p matrix. In our context, this is exactly the matrix 1n1XTX\frac{1}{n-1}X^TX in a compact way - recall that WLOG we assumed μ=0\bm \mu = \bm 0. Here matrix product A×BA\times B is being done in an equivalent way: column ii of AA is matrix-multiplied by row ii of BB for every ii from 1 to pp, then add all these matrices to get A×BA\times B.

Spectral Decomposition:

A symmetric matrix A has a full set of real eigenvectors, which can be chosen to be orthonormal. This gives a decomposition A=QΛQTA= QΛQ^T, where QQ is orthonormal and ΛΛ is diagonal. The columns of QQ are eigenvectors, and the diagonal entries λjλ_j of ΛΛ are the corresponding eigenvalues. I.e., symmetric matrices are diagonal in some basis. A symmetric matrix A is positive semidefinite iff each λj0λ_j ≥0. Being a symmetric, positive semi-definite matrix, XTXX^TX is diagonalizable:

XTX=WΛWTX^TX = W \Lambda W^T

where Λ\Lambda is the diagonal matrix of eigenvalues of XTXX^TX. The columns of WW are eigenvectors of XTXX^TX which are also the principal components. Because trace is invariant under the change of basis and since the original diagonal entries in the covariance matrix XTXX^TX are the variances of the features, the sum of the eigenvalues must also be equal to the sum of the original variances. In other words, the cumulative proportion of the top kk eigenvalue is the "explained variance" of the first kk principal components.

Singular Value Decomposition

The spectral decomposition is a special case of the singular value decomposition, which states that any matrix Am×nA_{m\times n} can be expressed as A=UΣVA=UΣV where Um×mU_{m \times m} and Vn×nV_{n\times n} are unitary matrices and Σm×nΣ_{m\times n} is a diagonal matrix. The principal components transformation can also be associated with another matrix factorization, the singular value decomposition (SVD) of XX,

X=UΣWTX = U\Sigma W^T

Limitation of PCA

Autoencoders (Advanced PCA) and Nonlinear Dimensionality Reduction

An autoencoder is a feed-forward neural net whose job it is to take an input x\bm x and predict itself x\bm x. To make this non-trivial, we need to add a bottleneck layer whose dimension is much smaller than the input. Deep nonlinear autoencoders learn to project the data, not onto a subspace, but onto a nonlinear manifold.

graph BT; A(input: 784 units) --> B(100 units) --> C(code vector: 20 units) C --> D(100 units) --> E(reconstruction: 784 units)

The lower half of the architecture is called encoder and the top half is called decoder. These autoencoders have non-linear activation functions after every feed-forward layer so they can learn more powerful codes for a given dimensionality, compared with linear autoencoders (PCA)

Loss function is naturally xx~2|| \bm x - \bm {\tilde x}||^2, the sum of square error. It is proven result that the linear autoencoder is equivalent to PCA: If you restrict the autoencoder to:

Then...

Property PCA Autoencoder
Projection type Linear Can be nonlinear
Reconstruction Orthogonal projection Learned mapping
Training SVD Gradient descent
Noise Handling Poor Can use denoising autoencoders
Dimensionality Fixed Can use variational AEs, sparse AEs, etc.

Unsupervised Learning Techniques: Clustering

Sometimes the data form clusters, where examples within a cluster are similar to each other, and examples in different clusters are dissimilar. Such a distribution is multimodal, since it has multiple modes, or regions of high probability mass. Grouping data points into clusters, with no labels, is called clustering. Clustering is used in a wide variety of applications, including:

K-Means

K-means is a famous hard clustering algorithm. Assume the data x1,,xN\bm x_1, \dots, \bm x_N lives in a Euclidean space, xnRd\bm x_n ∈ \mathbb R^d. Assume the data belongs to K classes (patterns), the data points from same class are similar, i.e. close in Euclidean distance. How can we identify those classes (data points that belong to each class)? K-means assumes there are K clusters, and each point is close to its cluster center (the mean of points in the cluster). If we knew the cluster assignment we could easily compute means. If we knew the means we could easily compute cluster assignment.

For each data point xnx_n, we introduce a corresponding set of binary indicator variables rnk{0,1}r_{nk} ∈ \{0, 1\}, where k=k = 1, . . . , K describing which of the K clusters the data point xnx_n is assigned to, so that if data point xnx_n is assigned to cluster k then rnk=1r_{nk} = 1, and rnj=0r_{nj} = 0 for jkj \ne k. We can then define an objective function,

J=n=1Nk=1Krnkxnμk2J=\sum_{n=1}^N\sum_{k=1}^K r_{nk} ||\bm x_n - \bm \mu_k||^2

which represents the sum of the squares of the distances of each data point to its assigned vector μk\bm \mu_k. Our goal is to find values for the rnkr_{nk} and the μk\bm \mu_k so as to minimize JJ. We can do this through an iterative procedure in which each iteration involves two successive steps:

First we choose some random initial values for the μk\bm \mu_k (better if it is one of the points in the set). Then in the first phase we minimize JJ with respect to the rnkr_{nk}, keeping the μk\bm \mu_k fixed. In the second phase we minimize JJ with respect to the μk\bm \mu_k, keeping rnkr_{nk} fixed.


This two-stage optimization is then repeated until convergence. We shall see that these two stages of updating rnkr_{nk} and updating μk\bm \mu_k correspond respectively to the E (expectation) and M (maximization) steps of the EM algorithm. Because JJ is a linear function of rnkr_{nk}, this optimization can be performed easily to give a closed form solution:

rnk={1if k=arg minjxnμj20otherwiser_{nk} = \begin{cases} 1 &\text{if $k = \argmin_j ∥\bm x_n− \bm \mu_j∥^2$} \\ 0 & \text{otherwise} \end{cases}

Now consider the optimization of the μk\bm \mu_k given rnkr_{nk} which is an easy solution:

μk=nrnkxnnrnk\bm \mu_k = \frac{\sum_n r_{nk} \bm x_n}{\sum_n r_{nk}}

The denominator in this expression is equal to the number of points assigned to cluster kk. For this reason, the procedure is known as the K-means algorithm. K-Means can also be seen as a matrix factorization like PCA
minRXRM2\min_{R} || X- RM||^2

where RR is cluster assignment and MM are centroids. In K-means, each cluster forms a Voronoi cell: region closest to that centroid. The decision boundaries between clusters are linear — K-Means assumes spherical, equally sized clusters in Euclidean space. The K-means algorithm is based on the use of squared Euclidean distance as the measure of dissimilarity between a data point and a prototype vector. Not only does this limit the type of data variables that can be considered (it would be inappropriate for cases where some or all of the variables represent categorical labels for instance), but it can also make the determination of the cluster means nonrobust to outliers. We can generalize the K-means algorithm by introducing a more general dissimilarity measure between two vectors x\bm x and x\bm x'. K-means is sensetive to outliers as they can shift the mean significantly. Use Robust alternatives (e.g., K-Medoids) by choosing a more approriate dissimilarity measure.

Whenever an assignment is changed, the sum squared distances JJ of data points from their assigned cluster centers is reduced. The objective JJ is non-convex (so coordinate descent on JJ is not guaranteed to converge to the global minimum - NP-hard, but Lloyd’s gives local optima.) Unfortunately, although the algorithm is guaranteed to converge, it may not converge to the right solution (i.e., it may get stuck at local minima): this depends on the centroid initialization.

drawing

We could try non-local split-and-merge moves: simultaneously merge two nearby clusters and split a big cluster into two. The general solution is to run the algorithm multiple times with different random initializations and keep the best solution. To select the number of cluster, you may use the elbow curve on k-means loss or the mean silhouette coefficient over all the instances. An instance’s silhouette coefficient is equal to

b    amax(a,b)\frac{b \; – \; a}{\max(a, b)}

where aa is the mean distance to the other instances in the same cluster (it is the mean intra-cluster distance), and bb is the mean nearest-cluster distance, that is the mean distance to the instances of the next closest cluster (defined as the one that minimizes bb, excluding the instance’s own cluster). The silhouette coefficient can vary between -1 and +1: aa coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster. An even more informative visualization is obtained when you plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram:

drawing

As the figure above shows, when k=5, all clusters have similar sizes, so even though the overall silhouette score from k=4 is slightly greater than for k=5, it seems like a good idea to use k=5 to get clusters of similar sizes. K-Means does not behave very well when the clusters have varying sizes, different densities, or non-spherical shapes. For example, the following figure shows how K-Means clusters a dataset containing three ellipsoidal clusters of different dimensions, densities and orientations:

drawing

It is important to scale the input features before you run K-Means, or else the clusters may be very stretched, and K-Means will perform poorly. Scaling the features does not guarantee that all the clusters will be nice and spherical, but it generally improves things. You can also think of K-means as some sort of compression: every point is replaced by its cluster centroid.

Mixtures of Gaussians

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid. Each cluster can have a different ellipsoidal shape, size, density and orientation. It is a generative model, meaning you can actually sample new instances from it. There are several GMM variants: in the simplest variant, implemented in the GaussianMixture class, you must know in advance the number kk of Gaussian distributions.

We now turn to a formulation of Gaussian mixtures in terms of discrete latent variables. This will provide us with a deeper insight into this important distribution, and will also serve to motivate the expectation-maximization algorithm. Gaussian mixture distribution can be written as a linear superposition of Gaussians in the form

p(x)=k=1KπkN(xμk,Σk)p(\bm x) = \sum_{k=1}^K π_k \mathcal{N}(\bm x \mid \bm {\mu}_k, \Sigma_k)

drawing

Let variable z\bm z having a 1-of-K representation in which a particular element zkz_k is equal to 1 and all other elements are equal to 0. We shall define the joint distribution p(x,z)p(\bm x, \bm z) in terms of a marginal distribution p(z)p(\bm z) and a conditional distribution p(xz)p(\bm x\mid \bm z). The marginal distribution over z\bm z is p(zk=1)=πkp(z_k = 1) = π_k where kπk=1\sum_k π_k = 1. Because z\bm z uses a 1-of-K representation (remember, one and only one zkz_k can be 1), we can also write this distribution in the form

p(z)=kπkzkp(\bm z) = \prod_k π_k^{z_k}

Also, we have the conditional probability:

p(xzk=1)=N(xμk,Σk)p(\bm x \mid z_k = 1) = \mathcal N (\bm x \mid \bm {\mu}_k, Σ_k)

The joint distribution is given by p(z)p(xz)p(\bm z)p(\bm x\mid \bm z), and the marginal distribution of x\bm x is then obtained by summing the joint distribution over all possible states of z\bm z to give

p(x)=zp(z)p(xz)=kπkN(xμk,Σk)p(\bm x) = \sum_{\bm z} p(\bm z)p(\bm x\mid \bm z) = \sum_k π_k \mathcal{N}(\bm x \mid \bm {\mu}_k, \Sigma_k)

For every observation xn\bm x_n, there is a corresponding latent variable zn\bm z_n. Another quantity that will play an important role is the conditional probability of p(zx)p(\bm z\mid \bm x), whose value can be found using Bayes’ theorem:

p(zk=1x)=πkN(xμk,Σk)jπjN(xμj,Σj)p(z_k=1 \mid x) = \frac{π_k \mathcal N(\bm x \mid \bm \mu_k, \Sigma_k)}{\sum_j π_j \mathcal{N}(\bm x \mid \bm {\mu}_j, \Sigma_j)}

Suppose we have a dataset of observations {x1,...,xN}\{\bm x_1, . . . , \bm x_N \}, and we wish to model this data using a mixture of Gaussians. We can represent this dataset as an N×DN \times D matrix XX in which the nth row is given by xT\bm x^T. Similarly, the corresponding latent variables will be denoted by an N×KN × K matrix ZZ with rows znT\bm z_n^T. If we assume that the data points are drawn independently from the distribution, then we can express the Gaussian mixture model for this i.i.d. dataset. The log of the likelihood function is given by

lnp(Xπ,μ,Σ)=n=1Nlnk=1KπkN(xnµk,Σk).\ln p(X \mid \bm π, \bm \mu, Σ) = \sum_{n=1}^N \ln \sum_{k=1}^K π_k \mathcal N (x_n \mid \bm µ_k, Σ_k).

Maximizing the above log likelihood function turns out to be a more complex problem than for the case of a single Gaussian. The difficulty arises from the presence of the summation over k that appears inside the logarithm, so that the logarithm function no longer acts directly on the Gaussian. If we set the derivatives of the log likelihood to zero, we will no longer obtain a closed form solution. This summation affect could create a singularity in the process of maximization. This can occur when a Guassian collapses to a point. Assume Σk=σk2IΣ_k = σ^2_k\bm I and μj=xn\bm \mu_j = \bm x_n for some value nn. This data point contribute a term in the likelihood function of the form:

N(xnxn,σj2I)=1(2π)1/2σj\mathcal N (x_n \mid \bm x_n, σ^2_j\bm I) = \frac{1}{(2π)^{1/2}\sigma_j}

If σj0\sigma_j \rightarrow 0, then we see that this term goes to infinity and so the log likelihood function will also go to infinity. Thus the maximization of the log likelihood function is not a well posed problem because such singularities will always be present. Recall that this problem did not arise in the case of a single Gaussian distribution. However if a single Gaussian collapses onto a data point it will contribute multiplicative factors to the likelihood function arising from the other data points and these factors will go to zero exponentially fast, giving an overall likelihood that goes to zero rather than infinity. However, once we have (at least) two components in the mixture, one of the components can have a finite variance and therefore assign finite probability to all of the data points while the other component can shrink onto one specific data point and thereby contribute an ever increasing additive value to the log likelihood. By data-complete we mean that for each observation in XX, we were told the corresponding value of the latent variable ZZ. We shall call {X,Z}\{X, Z\} the complete dataset, and we shall refer to the actual observed data XX as incomplete. Now consider the problem of maximizing the likelihood for the complete dataset {X,Z}\{X, Z\}. This likelihood function takes the form

p(X,Zμ,Σ,π)=p(XZ,μ,Σ,π)p(Z)=n=1Nk=1KπkznkN(xnμk,Σk)znkp(X, Z \mid \mu, Σ, π) = p(X\mid Z, \mu, Σ, π) p(Z) = \prod_{n=1}^N \prod_{k=1}^K π_k^{z_{nk}} \mathcal N(\bm x_n \mid \bm \mu_k, \Sigma_k)^{z_{nk}}

where znkz_{nk} denotes the kth component of znz_n. Taking the logarithm, we obtain

lnp(X,Zμ,Σ,π)=n=1Nk=1Kznk(lnπk+lnN(xnμk,Σk))\ln p(X, Z \mid \mu, Σ, π) = \sum_{n=1}^N \sum_{k=1}^K z_{nk} \Big( \ln π_k + \ln \mathcal N(\bm x_n \mid \bm \mu_k, \Sigma_k) \Big)

with constraint kπk=1\sum_k π_k = 1. The maximization with respect to a mean or a covariance is exactly as for a single Gaussian, except that it involves only the subset of data points that are ‘assigned’ to that component. For the maximization with respect to the mixing coefficients, again, this can be enforced using a Lagrange multiplier as before, and leads to the result

πk=1Nn=1Nznkπ_k = \frac{1}{N} \sum_{n=1}^N z_{nk}

So the complete-data log likelihood function can be maximized trivially in closed form. In practice, however, we do not have values for the latent variables. Our state of knowledge of the values of the latent variables in ZZ is given only by the posterior distribution p(ZX,θ)p(Z|X, θ), in this case p(ZX,μ,Σ,π)p(Z\mid X, \mu, Σ, π).

p(ZX,μ,Σ,π)p(XZ,μ,Σ,π)p(Z)=n=1Nk=1KπkznkN(xnμk,Σk)znkp(Z \mid X, \mu, Σ, π) \propto p(X\mid Z, \mu, Σ, π) p(Z) = \prod_{n=1}^N \prod_{k=1}^K π_k^{z_{nk}} \mathcal N(\bm x_n \mid \bm \mu_k, \Sigma_k)^{z_{nk}}

Because we cannot use the complete-data log likelihood lnp(X,Zμ,Σ,π)\ln p(X, Z \mid \mu, Σ, π), we consider instead its expected value under the posterior distribution of the latent variable to be maximized (according to EM algorithm):

EZ[lnp(X,Zμ,Σ,π)]=EZ[n=1Nk=1Kznk(lnπk+lnN(xnμk,Σk))]=n=1Nk=1KE[znk](lnπk+lnN(xnμk,Σk))=n=1Nk=1Kγ(znk)(lnπk+lnN(xnμk,Σk))\begin{align*} \mathbb E_Z[ \ln p(X, Z \mid \mu, Σ, π)] & = \mathbb E_Z\Big[ \sum_{n=1}^N \sum_{k=1}^K z_{nk} \Big( \ln π_k + \ln \mathcal N(\bm x_n \mid \bm \mu_k, \Sigma_k) \Big)\Big]\\ & = \sum_{n=1}^N \sum_{k=1}^K \mathbb E[z_{nk}] \Big( \ln π_k + \ln \mathcal N(\bm x_n \mid \bm \mu_k, \Sigma_k) \Big) \\ & = \sum_{n=1}^N \sum_{k=1}^K \mathbb \gamma(z_{nk}) \Big( \ln π_k + \ln \mathcal N(\bm x_n \mid \bm \mu_k, \Sigma_k) \Big) \end{align*}

Because

p(zk=1x)=E[znk]=πkN(xμk,Σk)jπjN(xμj,Σj)=γ(znk)p(z_k=1 \mid x) = \mathbb E[z_{nk}] = \frac{π_k \mathcal N(\bm x \mid \bm \mu_k, \Sigma_k)}{\sum_j π_j \mathcal{N}(\bm x \mid \bm {\mu}_j, \Sigma_j)} = \gamma(z_{nk})

according to Bayes' Theorem. According to EM algorithm, first we choose some initial values for the parameters μold\mu_{\text{old}}, Σold\Sigma_{\text{old}}, πoldπ_{\text{old}}, and use these to evaluate the responsibilities γ(znk)\gamma(z_{nk}) (the E step) from the previous equation. We then keep the responsibilities fixed and maximize the expectation mentioned above with respect to µkµ_k, ΣkΣ_k and πkπ_k (the M step). This leads to closed form solutions for μnew\mu_{\text{new}}, Σnew\Sigma_{\text{new}}, πnewπ_{\text{new}}:

μnewk=1Nkn=1Nγ(znk)xnΣnewk=1Nkn=1Nγ(znk)(xnμnewk)(xnμnewk)Tπnew=NkN\begin{align*} \bm \mu_{\text{new}}^k & = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}) \bm x_n\\ \Sigma_{\text{new}}^k & = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}) (\bm x_n - \bm \mu_{\text{new}}^k) (\bm x_n - \bm \mu_{\text{new}}^k)^T \\ π_{\text{new}} & = \frac{N_k}{N} \end{align*}

where Nk=n=1Nγ(znk)N_k = \sum_{n=1}^N \gamma(z_{nk}). Evaluate the log likelihood

lnp(Xπ,μ,Σ)=n=1Nlnk=1KπkN(xnµk,Σk).\ln p(X \mid \bm π, \bm \mu, Σ) = \sum_{n=1}^N \ln \sum_{k=1}^K π_k \mathcal N (x_n \mid \bm µ_k, Σ_k).

and check for convergence of either the parameters or the log likelihood.

drawing

If we knew the parameters θ={πk,µk,Σk}θ= \{π_k ,µ_k ,Σ_k \}, we could infer which component a data point x\bm x probably belongs to by inferring its latent variable zi\bm z_i. This is just posterior inference, which we do using Bayes’ Rule:

p(zkx)=p(zk)p(xzk)kp(zk)p(xzk)p(z_{k} \mid \bm x) = \frac{p(z_k)p(\bm x \mid z_k)}{\sum_k p(z_k)p(\bm x \mid z_k)}

Just like Naive Bayes, GDA (meaning LDA amd QDA), etc. at test time.

We use EM for GMMs instead of GD because the objective involves latent variables (unobserved cluster assignments), making direct optimization via gradient descent intractable. EM is a closed-form, coordinate ascent method tailored for problems with hidden structure. Using gradient descent directly for

lnp(Xπ,μ,Σ)=n=1Nlnk=1KπkN(xnµk,Σk).\ln p(X \mid \bm π, \bm \mu, Σ) = \sum_{n=1}^N \ln \sum_{k=1}^K π_k \mathcal N (x_n \mid \bm µ_k, Σ_k).

is very messy gradient. Each point xn\bm x_n could have come from any of K Gaussians, and we don’t know which. There is no closed-form gradient for mixture weights πkπ_k plus they should meet constraints (sum to 1) as well as covariance matrix contraint which makes derivatives expensive and complicated to compute. The solution of the above equation is invarient to permutation of parameters so its not a convex optimization just like neural networks. EM solves this neatly by using posterior probabilities as soft assignments and Turning the hard likelihood into an expected complete-data log-likelihood, which can be optimized in closed form.

Unfortunately, just like K-Means, EM can end up converging to poor solutions, so it needs to be run several times, keeping only the best solution. When there are many dimensions, or many clusters, or few instances, EM can struggle to converge to the optimal solution. You might need to reduce the difficulty of the task by limiting the number of parameters that the algorithm has to learn: one way to do this is to limit the range of shapes and orientations that the clusters can have. This can be achieved by imposing constraints on the covariance matrices.

The computational complexity of training a GaussianMixture model depends on the number of instances mm, the number of dimensions nn, the number of clusters kk, and the constraints on the covariance matrices. If covariance_type is "spherical or "diag", it is O(kmn)\mathcal O(kmn), assuming the data has a clustering structure. If covariance_type is "tied" or "full", it is O(kmn2+kn3)\mathcal O(kmn^2 + kn^3), so it will not scale to large numbers of features.

Using a Gaussian mixture model for anomaly detection is quite simple: any instance located in a low-density region can be considered an anomaly. You must define what density threshold you want to use. Gaussian mixture models try to fit all the data, including the outliers, so if you have too many of them, this will bias the model’s view of “normality”: some outliers may wrongly be considered as normal. If this happens, you can try to fit the model once, use it to detect and remove the most extreme outliers, then fit the model again on the cleaned up dataset. Another approach is to use robust covariance estimation methods.

The General EM Algorithm

Given a joint distribution p(X,Zθ)p(X, Z \mid θ) over observed variables XX and latent variables ZZ, governed by parameters θθ, the goal is to maximize the likelihood function p(Xθ)p(X\mid θ) with respect to θθ.

  1. Choose an initial setting for the parameters θold\theta_{\text{old}}.
  2. E step: Evaluate p(ZX,θold)p(Z\mid X, \theta_{\text{old}}).
  3. M step: Evaluate θnew\theta_{\text{new}} given by
    θnew=arg maxθQ(θ,θold)\theta_{\text{new}} = \argmax_{\theta} \mathcal Q(θ, \theta_{\text{old}})
    where Q(θ,θold)=Zp(ZX,θold)lnp(X,Zθ)\mathcal Q(θ, \theta_{\text{old}}) = \sum_Z p(Z\mid X, \theta_{\text{old}}) \ln p(X,Z\mid \theta)
  4. Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let θoldθnew\theta_{\text{old}} \leftarrow \theta_{\text{new}}
    and return to step II.

Relation to K-means

Comparison of the K-means algorithm with the EM algorithm for Gaussian mixtures shows that there is a close similarity. Whereas the K-means algorithm performs a hard assignment of data points to clusters, in which each data point is associated uniquely with one cluster, the EM algorithm makes a soft assignment based on the posterior probabilities. In fact, we can derive the K-means algorithm as a particular limit of EM for Gaussian mixtures just like a soft version of K-means, with fixed priors and covariance. GMM reduces to K-Means if all Gaussians have identical spherical covariances and assignments are hard. See Pattern Recognition and Machine Learning, p443.

Bayesian Gaussian Mixture Models

Rather than manually searching for the optimal number of clusters, it is possible to use instead the BayesianGaussianMixture class which is capable of giving weights equal (or close) to zero to unnecessary clusters. Just set the number of clusters n_com ponents to a value that you have good reason to believe is greater than the optimal number of clusters (this assumes some minimal knowledge about the problem at hand), and the algorithm will eliminate the unnecessary clusters automatically.

from sklearn.mixture import BayesianGaussianMixture
bgm = BayesianGaussianMixture(n_components=10, n_init=10, random_state=42)
bgm.fit(X)
np.round(bgm.weights_, 2)

array([0.4 , 0.21, 0.4 , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])

Perfect: the algorithm automatically detected that only 3 clusters are needed. In this model, the cluster parameters (including the weights, means and covariance matrices) are not treated as fixed model parameters anymore, but as latent random variables, like the cluster assignments.

Prior knowledge about the latent variables z\bm z can be encoded in a probability distribution p(z)p(\bm z) called the prior. For example, we may have a prior belief that the clusters are likely to be few (low concentration), or conversely, that they are more likely to be plentiful (high concentration). This can be adjusted using the weight_concentration_prior hyperparameter. However, the more data we have, the less the priors matter. In fact, to plot diagrams with such large differences, you must use very strong priors and little data.

The EM Algorithm: Why it Works

The expectation maximization algorithm, or EM algorithm, is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables (Dempster et al., 1977; McLachlan and Krishnan, 1997). The goal of the EM algorithm is to find maximum likelihood solutions for models having latent variables like our situation here. The set of all model parameters is denoted by θθ, and so the log likelihood function is given by

lnp(Xθ)=lnZp(X,Zθ)\ln p(X \mid \theta) = \ln \sum_Z p(X,Z \mid \theta)

A key observation is that the summation over the latent variables appears inside the logarithm. The presence of the sum prevents the logarithm from acting directly on the joint distribution, resulting in complicated expressions for the maximum likelihood solution.

Consider a probabilistic model in which we collectively denote all of the observed variables by X and all of the hidden variables by Z. The joint distribution p(X,Zθ)p( X, Z|θ) is governed by a set of parameters denoted θθ. Our goal is to maximize the likelihood function that is given by

p(Xθ)=Zp(X,Zθ).p(X\mid θ) = \sum_{Z} p(X, Z|θ).

Here we are assuming Z is discrete, although the discussion is identical if Z comprises continuous variables or a combination of discrete and continuous variables, with summation replaced by integration as appropriate. We shall suppose that direct optimization of p(Xθ)p(X|θ) is difficult, but that optimization of the complete-data likelihood function p(X,Zθ)p(X, Z|θ) is significantly easier. As mentioned before, in practice we are not given the complete dataset {X,Z}\{X, Z\}, but only the incomplete data XX. Our state of knowledge of the values of the latent variables in ZZ is given only by the posterior distribution p(ZX,θ)p(Z|X, θ). Because we cannot use the complete-data log likelihood, we consider instead its expected value under distribution of the latent variable, which corresponds (as we shall see) to the E step of the EM algorithm. Next we introduce a distribution q(Z)q(Z) defined over the latent variables.

lnp(Xθ)=Zq(Z)lnp(Xθ)=Zq(Z)lnp(X,Zθ)p(ZX,θ)=Zq(Z)lnp(X,Zθ)q(Z)p(ZX,θ)q(Z)=Zq(Z)lnp(X,Zθ)q(Z)Zq(Z)lnp(ZX,θ)q(Z)\begin{align*} \ln p(X\mid θ) & = \sum_Z q(Z) \ln p(X\mid θ) \\ & = \sum_Z q(Z) \ln \frac{p(X,Z\mid \theta)}{p(Z\mid X, \theta)} \\ & = \sum_Z q(Z) \ln \frac{\frac{p(X,Z\mid \theta)}{q(Z)}}{\frac{p(Z\mid X, \theta)}{q(Z)}} \\ & = \sum_Z q(Z) \ln \frac{p(X,Z\mid \theta)}{q(Z)} - \sum_Z q(Z)\ln \frac{p(Z\mid X, \theta)}{q(Z)} \end{align*}

The first term is named L(q,θ)\mathcal L(q, \theta) and the second term is KL(qp)KL(q \parallel p) is the Kullback-Leibler divergence between q(Z)q(Z) and the posterior distribution p(ZX,θ)p(Z|X, θ). So we obtain the decomposition:

lnp(Xθ)=L(q,θ)+KL(qp)\begin{equation*} \ln p(X\mid θ) = \mathcal L(q, \theta) + \text{KL}(q\parallel p) \tag{\ddag} \end{equation*}

Recall that the Kullback-Leibler divergence satisfies KL(qp)0KL(q \parallel p) \ge 0, with equality if, and only if, q(Z)=p(ZX,θ)q(Z) = p(Z \mid X, θ). It therefore follows from the above equation that L(q,θ)lnp(Xθ)    q,θ\mathcal L(q, θ) \le \ln p(X|θ) \;\; \forall q, \theta, in other words that L(q,θ)\mathcal L(q, \theta) is a lower bound on lnp(Xθ)\ln p(X|θ).

The EM algorithm is a two-stage iterative optimization technique for finding maximum likelihood solutions. We can use the above decomposition to define the EM algorithm and to demonstrate that it does indeed maximize the log likelihood. Suppose that the current value of the parameter vector is θoldθ_\text{old}.

drawing

Substitute q(Z)=p(ZX,θold)q(Z) = p(Z|X, θ_\text{old}) into definition of L(q,θ)\mathcal L(q, θ), we see that in the M step, the quantity that is being maximized is the expectation of the complete-data log likelihood

L(q,θold)=Zp(ZX,θold)lnP(X,Zθ)Zp(ZX,θold)lnP(X,Zθold)\mathcal L(q, θ_\text{old}) = \sum_Z p(Z\mid X, θ_\text{old}) \ln P(X,Z \mid \theta) - \sum_Z p(Z\mid X, θ_\text{old}) \ln P(X,Z \mid θ_\text{old})

where the second term is constant as it is simply the negative entropy of the qq distribution and is therefore independent of θθ. Thus the EM algorithm are increasing the value of a well-defined bound on the log likelihood function and that the complete EM cycle will change the model parameters in such a way as to cause the log likelihood to increase (unless it is already at a maximum, in which case the parameters remain unchanged).

We can also use the EM algorithm to maximize the posterior distribution p(θX)p(θ \mid X) for models in which we have introduced a prior p(θ)p(θ) over the parameters. To see this, we note that as a function of θθ, we have p(θX)=p(θ,X)/p(X)p(θ \mid X) = p(θ, X)/p(X) and so

lnp(θX)=L(q,θ)+KL(qp)+lnp(θ)lnp(X)L(q,θ)+lnp(θ)lnp(X)L(q,θ)+lnp(θ).\begin{align*} \ln p(θ\mid X) &= \mathcal L(q, θ) + KL(q\parallel p) + \ln p(θ)− \ln p(X)\\ &\ge \mathcal L(q, θ) + \ln p(θ)− \ln p(X)\\ &\ge \mathcal L(q, θ) + \ln p(θ). \end{align*}

where lnp(X)\ln p(X) is a constant. We can again optimize the right-hand side alternately with respect to qq and θθ. The optimization with respect to qq gives rise to the same E-step equations as for the standard EM algorithm, because qq only appears in L(q,θ)\mathcal L(q, θ). The M-step equations are modified through the introduction of the prior term lnp(θ)\ln p(θ), which typically requires only a small modification to the standard maximum likelihood M-step equations.

Interpreting ML

Partial dependence plots (PDP) show the dependence between the objective function (target response) and a set of input features of interest, marginalizing over the values of all other input features (the ‘complement’ features). Intuitively, we can interpret the partial dependence as the expected target response as a function of the input features of interest.

Let XsX_s be the set of input features of interest (i.e. the features parameter). Assuming the feature are not correlated (are independence), the partial dependence of the response ff at a point xsx_s is defined as:

Ex[f(xs,x)]=f(xs,x)  pXcXs(x)dx=f(xs,x)  pXc(x)dx1ni=1nf(xs,xc(i))\begin{align*} \mathbb E_x[f(x_s,x)] & = \int f(x_s, x) \; p_{X_c\mid X_s}(x)dx \\ & = \int f(x_s, x) \; p_{X_c}(x)dx\\ & \approx \frac{1}{n} \sum_{i=1}^n f(x_s, x^{(i)}_c) \end{align*}

where nn is the number of times Xs=xsX_s = x_s. Due to the limits of human perception, the size of the set of input features of interest must be small (usually, one or two) thus the input features of interest are usually chosen among the most important features.

The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled.

SHAPLEY Values

SHAP values are one of the most powerful and interpretable ways to understand how each feature affects an individual prediction. They’re based on solid math and are widely used in explainable ML. It is a concept from game theory - Nobel Prize in Economics 2012. In the ML context, the game is prediction of an instance, each player is a feature, coalitions are subsets of features, and the game payoff is the difference in predicted value for an instance and the mean prediction (i.e. null model with no features used in prediction). The Shapley value ϕi(v)\phi_i(v) is given by the formula:

1number of playerscoalitions excluding imariginal contribution of i to coalitionnumber of coalitions excluding i of this size\frac{1}{\text{number of players}}\sum_{\text{coalitions excluding $i$}} \frac{\text{mariginal contribution of $i$ to coalition}}{\text{number of coalitions excluding $i$ of this size}}

As a simple example, suppose there are 3 features aa, bb and cc used in a regression problem. The figure below shows the possible coalitions, where the members are listed in the first line and the predicted value for the outcome using just that coalition in the model is shown in the second line.

drawing

Let's work out the Shapley value of feature aa. First we work out the weights:

drawing

The red arrows point to the coalitions where aa was added (and so made a contribution). To figure out the weights there are 2 rules:

Now we multiply the weights by the marginal contributions - the value minus the coalition without that feature. So we have Shapely values as follows:

ψa(v)=13(105100)+16(125120)+16(10090)+13(115130)ψb(v)=13(90100)+16(100105)+16(130120)+13(115125)ψc(v)=13(120100)+16(125105)+16(13090)+13(115100)\psi_a(v) = \frac{1}{3}(105-100) + \frac{1}{6}(125-120) + \frac{1}{6}(100-90) + \frac{1}{3}(115-130) \\ \psi_b(v) = \frac{1}{3}(90-100) + \frac{1}{6}(100-105) + \frac{1}{6}(130-120) + \frac{1}{3}(115-125) \\ \psi_c(v) = \frac{1}{3}(120-100) + \frac{1}{6}(125-105) + \frac{1}{6}(130-90) + \frac{1}{3}(115-100) \\

then,

ψa(v)=0.833ψb(v)=5.833ψc(v)=21.666\psi_a(v) = -0.833 \\ \psi_b(v) = -5.833\\ \psi_c(v) = 21.666\\

So ψa(v)+ψb(v)+ψc(v)=14.999\psi_a(v) + \psi_b(v) + \psi_c(v) = 14.999.

SHAP

Shapley Additive exPlanations:

Note that the sum of Shapley values for all features is the same as the difference between the predicted value of the full model (with all features) and the null model (with no features, which is the default prediction for all instances) . So the Shapley values provide an additive measure of feature importance. In fact, we have a simple linear model that explains the contributions of each feature.

You can just get an "importance" type bar graph, which shows the mean absolute Shapley value for each feature. You can also use max absolute Shapley value as well.

drawing

Use a Beeswarm Plot to summarize the entire distribution of SHAP values for each feature:

drawing

This shows the jittered Shapley values for all instances for each feature (Titanic dataset). An instance to the right of the vertical line (null model which predict the constant mean of target response for all inputs) means that the model predicts a higher probability of survival than the null model. If a point is on the right and is red, it means that for that instance, a high value of that feature predicts a higher probability of survival. For example, sex_female shows that if you have a high value (1 = female, 0 = male) your probability of survival is increased. Similarly, younger age predicts higher survival probability.

Force plot

We show the forces acting on a single instance (index 10). The model predicts a probability of survival of 0.03, which is lower than the base probability of survival (0.39). The force plot shows which features affect the difference between the full and base model predictions. Blue arrows means that the feature decreases survival probability, red arrows means that the feature increases survival probability.

drawing

MLOps: Machine Learning Pipelines in Production

Workspace Setup

Here is a description of how to structure a workspace setup for a machine learning project optimized for production, covering Python environment, data pipeline, preprocessing, and deployment readiness:

  ml_project/
│
├── data/                     # (optional local) raw & processed data
│   ├── raw/
│   └── processed/
│
├── notebooks/                # For exploratory analysis (EDA)
│
├── src/                      # Source code
│   ├── config/               # Config files or Hydra config scripts
│   ├── data_pipeline/        # Data ingestion + validation
│   ├── preprocessing/        # Transformations shared by training & inference
│   ├── models/               # Model training, saving, loading
│   ├── evaluation/           # Metrics & evaluation scripts
│   └── serving/              # Inference and deployment scripts (API, CLI)
│
├── scripts/                  # CLI tools to run various stages
│
├── tests/                    # Unit tests
│
├── Dockerfile                # Containerization
├── requirements.txt / pyproject.toml
├── .env                      # Secrets/config (never commit)
└── README.md

Python Environment

Use virtual environments (like venv) and declare dependencies in requirements.txt.

Category Libraries
Environment pip, venv (default, built-in), poetry
Data pandas, numpy, pyarrow, dask, polars
EDA matplotlib, seaborn, sweetviz, pandas-profiling
ML Frameworks scikit-learn, xgboost, lightgbm, catboost, torch, tensorflow
Pipelines scikit-learn, dagster, airflow, prefect, kedro
Configs Hydra, OmegaConf, dotenv
Logging loguru, mlflow, wandb
Serving FastAPI, Flask, BentoML, TorchServe
Testing pytest, mypy, black, ruff, pylint
python3 -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Data Mining Tools

Use Case Is Pandas Ideal?
Exploratory Data Analysis (EDA) ✅ Best choice
Clean, tabular data in memory ✅ Excellent
Feature engineering for ML ✅ Widely used
>100MB–1GB datasets ⚠️ Still OK, but can slow
>10GB datasets ❌ Use Dask or Polars instead
import dask.dataframe as dd
df = dd.read_csv("big_file.csv")
df.groupby("col").mean().compute()
When to Use Dask

Spark:

When to Use Apache Spark

Many companies prototype in Dask, then move pipelines to Spark (or Databricks) for production-scale processing — especially if streaming or tight integration with data lakes is needed.

The moment your data is stored in the cloud, you usually want to move away from Dask/Spark clusters you manage directly and use cloud-native, serverless or managed alternatives.Here's a breakdown of cloud-native tools that replace Dask, Spark, and even Pandas workflows depending on the cloud provider:

Workflow Type Dask/Spark Equivalent in Cloud Cloud Tools (per provider)
Batch Data Processing (ETL) Spark, Dask - AWS Glue (serverless Spark)
- Google Dataflow (Apache Beam)
- Azure Data Factory (ADF)
Interactive Queries (SQL) Spark SQL, DuckDB, Dask DataFrames - Amazon Athena (serverless SQL on S3)
- BigQuery (GCP)
- Azure Synapse SQL Serverless
Large-Scale ML Pipelines Dask-ML, Spark MLlib - SageMaker Pipelines (AWS)
- Vertex AI Pipelines (GCP)
- Azure ML Pipelines
DataFrame-like Querying Pandas, Dask - Snowflake + Snowpark for Python
- BigQuery with pandas-gbq or dbt
Orchestrated Workflows (DAGs) Airflow, Dask Scheduler - AWS Step Functions
- Cloud Composer (Airflow on GCP)
- Azure Data Factory Pipelines
Streaming / Real-time Spark Structured Streaming - Kinesis Data Analytics (AWS)
- Dataflow + Pub/Sub (GCP)
- Azure Stream Analytics
Parquet/Arrow file I/O Dask, PyArrow - All clouds use Arrow + Parquet under the hood
(via Athena, BigQuery, Snowflake, etc.)

Cloud-by-Cloud Breakdown

AWS

Task Tool
ETL Pipelines AWS Glue (Spark), AWS Data Wrangler
SQL on S3 Amazon Athena
ML Pipelines SageMaker Pipelines
Workflow Orchestration AWS Step Functions + EventBridge
Serverless Python Queries AWS Data Wrangler + Pandas

GCP

Task Tool
ETL Pipelines Dataflow (Apache Beam)
SQL on GCS BigQuery
ML Pipelines Vertex AI Pipelines
Python/SQL Analysis BigQuery + pandas-gbq / Colab

Azure

Task Tool
ETL Pipelines Data Factory (ADF)
SQL on Blob Storage Synapse Serverless SQL
ML Pipelines Azure ML Pipelines
Streaming Azure Stream Analytics

Data Pipeline Setup

Goal: Ingest raw data → validate → clean → store processed version
Tools & Steps:

EDA and Feature Selection

Exploratory Data Analysis (EDA) is a critical first step in any data science or ML project. It helps you understand the structure, patterns, and anomalies in your data before modeling. Below is a structured set of EDA steps along with the most useful tools for each stage.

  1. Data Collection & Loading

Create a Python function to fetch and load the data into a EDA framework like Pandas: call fetch_housing_data() to creates a datasets directory in your workspace, downloads the compressed file, and extracts the data.csv from it in this directory.

  1. Initial Overview
  1. Data Types and Missing Values
  1. Univariate Analysis
  1. Bivariate & Multivariate Analysis
  1. Target Variable Analysis
  1. Outlier Detection
  1. Handling missing data:

    • drop them, set their value (0, mean, median, impute them)
    • Handling Text and Categorical Attributes: ordinal encoders, one-hot encoding (you get sparse matrix for categorical attributes with thousands of categories to avoid memory waste but still slows training for very large possible categories- in this case group them into one or you could replace each category with a learnable low dimensional vector called an embedding)
  2. Split Data into Train-Val-Test:
    Split data before applying any transformation that depends on data values:

    • Use stratified sampling to avoid significant sampling bias : the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population. Suppose you chatted with experts who told you that the median income is a very important attribute to predict median housing prices. You may want to ensure that the test set is representative of the various categories of incomes in the whole dataset. Since the median income is a continuous numerical attribute, you first need to create an income category attribute, for example by binning. It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of the stratum’s importance may be biased. This means that you should not have too many strata, and each stratum should be large enough. Now you are ready to do stratified sampling based on the income category.
    • Fix random_state=42 for re-producibility.
  3. Data Quality Checks

  1. Dimensionality Reduction (maybe - only if appropriate)
  1. Build a Full Preprocessing + Model Pipeline
    Write your own Custom Transformers for tasks such as custom cleanup operations or combining specific attributes. To work seamlessly with Scikit-Learn functionalities (such as pipelines), create a class and implement three methods: fit() (returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstimator as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐matic hyperparameter tuning. These transformers are needed for preprocessing such as feature scaling (min-max scaling and standardization), outlier handling or any other transformation of data. Then the pipeline exposes the same methods as the final estimator. Here is an example of a pipeline using custom transformers performing several data processing steps.

    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.impute import SimpleImputer
    from sklearn.ensemble import RandomForestClassifier
    
    # Split feature types
    numeric_cols = X.select_dtypes(include="number").columns
    cat_cols = X.select_dtypes(include="object").columns
    
    # Preprocessing pipelines
    numeric_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])
    
    cat_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Combine column-wise
    preprocessor = ColumnTransformer([
        ('num', numeric_pipeline, numeric_cols),
        ('cat', cat_pipeline, cat_cols)
    ])
    
    

    Create and use custom pipeline if needed - for example, for dropping a column:

    # Use a custom transformer inside the pipeline to drop a column cleanly
    from sklearn.base import BaseEstimator, TransformerMixin
    
    class ColumnDropper(BaseEstimator, TransformerMixin):
        def __init__(self, columns_to_drop=None):
            self.columns_to_drop = columns_to_drop or []
    
        def fit(self, X, y=None):
            return self
    
        def transform(self, X):
            return X.drop(columns=self.columns_to_drop)
    
    
    drop_cols = ["id", "duplicate_flag"]  # columns you don't want passed to model
    
    dropper = ColumnDropper(columns_to_drop=drop_cols)
    

    Here is another production-safe version of an outlier removal transformer that drops rows during training, but is designed to not drop any rows during inference. This respects the real-world constraint that you usually cannot drop incoming data at inference time.

    • During .fit() and .transform() on training data, outliers are removed (rows dropped).
    • During .transform() on test or inference data, rows are left untouched (you’ll typically log or monitor them, not drop).
    class OutlierRemover(BaseEstimator, TransformerMixin):
        def __init__(self, method='iqr', factor=1.5, z_thresh=3.0, apply_to='numeric'):
                self.method = method  # 'iqr' or 'zscore'
                self.factor = factor  # IQR factor
                self.z_thresh = z_thresh  # Z-score threshold
                self.apply_to = apply_to  # 'numeric' or list of column names
                self.columns_ = None
                self.stats_ = {}
    
        def fit(self, X, y=None):
                X = pd.DataFrame(X)
                if self.apply_to == 'numeric':
                    self.columns_ = X.select_dtypes(include='number').columns
                else:
                    self.columns_ = self.apply_to
    
                if self.method == 'iqr':
                    Q1 = X[self.columns_].quantile(0.25)
                    Q3 = X[self.columns_].quantile(0.75)
                    IQR = Q3 - Q1
                    self.stats_['lower'] = Q1 - self.factor * IQR
                    self.stats_['upper'] = Q3 + self.factor * IQR
                elif self.method == 'zscore':
                    self.stats_['mean'] = X[self.columns_].mean()
                    self.stats_['std'] = X[self.columns_].std()
                else:
                    raise ValueError("Method must be 'iqr' or 'zscore'")
                return self
    
        def transform(self, X, y=None):
                X = pd.DataFrame(X, columns=X.columns)
                if self.method == 'iqr':
                    mask = ((X[self.columns_] >= self.stats_['lower']) & 
                            (X[self.columns_] <= self.stats_['upper'])).all(axis=1)
                else:  # zscore
                    z_scores = (X[self.columns_] - self.stats_['mean']) / self.stats_['std']
                    mask = (np.abs(z_scores) < self.z_thresh).all(axis=1)
    
                if y is not None:
                    return X[mask].reset_index(drop=True), y[mask].reset_index(drop=True)
                else:
                    # ⚠️ At inference, do not drop rows — just return unchanged data
                    return X.reset_index(drop=True)
    

    It is import to know that scikit-learn pipelines are not designed to handle changes in the number of samples (rows) between steps like droping rows. They assume:

    • The number of samples (rows) in X and y stays the same
    • Each step transforms features (columns) only, not row counts

    Best Practice: Handle Row-Dropping Outside the Pipeline

        # Step 1: Clean training data (row-dropping)
        remover = OutlierRemover(method='iqr', factor=1.5)
        X_train_clean, y_train_clean = remover.fit_transform(X_train, y_train)
    
        # Create the pipeline to include all steps
        pipeline = Pipeline([
            ('dropper', dropper),                   # ✅ Drop irrelevant cols
            ('outlier_capper', IQRCapper(factor=1.5))  # non-destructive capping for outliers
            ('preprocessing', preprocessor),        # numeric + cat
            ('model', RandomForestClassifier())     # model
        ])
    

    The pipeline ensures that preprocessing is fit only on the training folds and applied properly on val/test folds during CV.

    # Fit Only on Training Data
    # Now it's safe to use cross-validation
    # Either
    cross_val_score(pipeline, X_train_clean, y_train_clean, cv=5)
    
    # Or -
    # full_pipeline.fit(X_train_clean, y_train_clean)
    

    Predict on Raw Test Data

    # Step 3: Use .transform() at inference (rows not dropped)
    X_test_transformed = remover.transform(X_test)
    preds = pipeline.predict(X_test_transformed)
    

    Or, serialize the pipeline for deployment

    # Serialize both objects separately for Deployment 
    from joblib import dump
    dump(remover, "outlier_remover.joblib")
    dump(pipeline, "model_pipeline.joblib")
    

    At inference:

    from joblib import load
    
    # Load both parts
    remover = load("outlier_remover.joblib")
    pipeline = load("model_pipeline.joblib")
    
    # Raw new input (e.g. from user or API)
    new_data = pd.DataFrame({...})
    
    # Apply same outlier logic (no row dropping!)
    cleaned_data = remover.transform(new_data)
    
    # Predict
    predictions = pipeline.predict(cleaned_data)
    

    Optionally you could wrap both parts remover and pipeline into one piece like this:

    class FullModelWithOutlierHandling:
        def __init__(self, remover, pipeline):
            self.remover = remover
            self.pipeline = pipeline
    
        def fit(self, X, y):
            X_clean, y_clean = self.remover.fit_transform(X, y)
            self.pipeline.fit(X_clean, y_clean)
    
        def predict(self, X):
            X_transformed = self.remover.transform(X)
            return self.pipeline.predict(X_transformed)
    
        def save(self, path_prefix="model"):
            from joblib import dump
            dump(self.remover, f"{path_prefix}_remover.joblib")
            dump(self.pipeline, f"{path_prefix}_pipeline.joblib")
    
        @classmethod
        def load(cls, path_prefix="model"):
            from joblib import load
            remover = load(f"{path_prefix}_remover.joblib")
            pipeline = load(f"{path_prefix}_pipeline.joblib")
            return cls(remover, pipeline)
    

    and use it like this:

    # Training
    model = FullModelWithOutlierHandling(remover, pipeline)
    model.fit(X_train, y_train)
    model.save("rf_model")
    
    # Inference
    model = FullModelWithOutlierHandling.load("rf_model")
    preds = model.predict(new_data)
    

    Benefits of This Design

    • ✅ Prevents leakage and keeps logic modular
    • ✅ Easy to save/load entire system
    • ✅ Clean API: just .fit() and .predict()
    • ✅ Fully compatible with joblib, MLflow, or FastAPI deployment
    • ✅ Transparent and testable

    Serialized pipelines are reusable, versionable, deployable, and production-safe. Copying code is error-prone, inconsistent, and not scalable. Imagine your training had StandardScaler() but your inference script forgot it — predictions will be totally wrong.

    Benefits of Serializing (joblib.dump, pickle, torch.save, etc.):

    Why It Matters What It Solves
    🛠 Consistency No need to re-run preprocessing manually in production
    🕰 Time-saving Avoid retraining or rewriting code to get the same result
    📦 Deployment-ready Easily load pipeline in a web service (e.g. FastAPI, Flask)
    💾 Versioning Save multiple models/pipelines with known behavior
    Integration Works well with MLflow, BentoML, SageMaker, etc
    🔄 Reproducibility Same output every time from the same serialized pipeline

    Fop pipelines involving Deep Learning or LLMs use Pytorch, TensorFlow or HiggingFace tools:

    Framework Equivalent to Pipeline Purpose
    PyTorch nn.Sequential, custom classes Compose neural nets, transformations
    PyTorch Lightning LightningModule + DataModule Structured, modular deep learning training
    TensorFlow tf.keras.Sequential, tf.data.Dataset Model + input pipeline
    Hugging Face Trainer, Pipeline, Transformers Full stack for training/inference of LLMs
    FastAI Learner, DataBlock High-level abstraction for PyTorch
    BentoML / MLflow Model serving w/ pre/post logic Deployment of DL/LLMs with preprocessing
    How It Maps:
    Stage scikit-learn Deep Learning Equivalent
    Preprocessing Pipeline torchvision.transforms, tf.data, datasets.Dataset.map()
    Model definition estimator nn.Module, Keras model, HF AutoModel
    Fitting .fit() .fit() / Trainer.train() / Trainer
    Inference pipeline .predict() pipeline() (Hugging Face), .forward()
    Deployment joblib.dump torch.save(), BentoML.save_model()
  2. Visualization Dashboard (Optional)

For large data or cloud-stored data, use BigQuery, or Athena with SQL to do EDA in-place instead of loading it all into memory.

Model Training, Selection and Evaluation

We want to explore data preparation options, trying out multiple models, shortlisting the best ones and fine-tuning their hyperparameters using GridSearchCV, and automating as much as possible. At this stage, we have a playground to try multiple model types, take care of overfitting/undefitting and parameter-tuning (optional to use grid search or randomized search when the hyperparameter search space is large) to choose the best model. Use K-fold cross-validation if it is not costly to train the model several times. You can easily save a Scikit-Learn models by using Python’s pickle module, or using sklearn.externals.joblib, which is more efficient at serializing large NumPy arrays. Also, you may want to use ensemble models to improve prediction performance.

Evaluate Your System on the Test Set:

After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set. There is nothing special about this process; just get the predictors and the labels from your test set, run your full_pipeline to transform the data (call transform(), not fit_transform(), you do not want to fit the test set!), and evaluate the final model on the test set:

final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse) # => evaluates to 47,730.2

Such a point estimate of the generalization error will not be quite enough to convince you to launch: what if it is just 0.1% better than the model currently in production? To have an idea of how precise this estimate is, you can compute a 95% confidence interval for the generalization error using scipy.stats.t.interval():

from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))

array([45685.10470776, 49691.25001878])

Deployment-Ready Inference

Now comes the project prelaunch phase: you need to present your solution (highlighting what you have learned, what worked and what did not, what assumptions were made, and what your system’s limitations are), document everything, and create nice presentations with clear visualizations and easy-to-remember statements (e.g., “the median income is the number one predictor of housing prices”). The final performance of the system is not better than the experts’, but it may still be a good idea to launch it, especially if this frees up some time for the experts so they can work on more interesting and productive tasks.

Use FastAPI or Flask to create an API server:

from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load("model.pkl")
preprocessor = joblib.load("preprocessor.pkl")

@app.post("/predict")
def predict(data: InputData):
    df = pd.DataFrame([data.dict()])
    X = preprocessor.transform(df)
    pred = model.predict(X)
    return {"prediction": pred.tolist()}

Launch, Monitor, and Maintain Your System

Perfect, you got approval to launch! You need to

Best Practices for Production

A production-optimized ML workspace should:

Full Life Cycle of MLOps Pipeline

Full life cycle of a typical ML pipeline — from raw data to live, monitored model — in a way that’s close to how companies actually build it in production.

  1. Problem Definition & Requirements
    Before touching code:
  1. Data Ingestion
    Bring in data from its source(s):
  1. Data Preprocessing
    Clean and prepare for modeling:
  1. Feature Engineering
    Enhance model signal:
  1. Model Training
    Core ML step:
  1. Model Evaluation
    Check performance before deployment:
  1. Packaging & Deployment
    Make it available for use:
  1. Monitoring & Maintenance
    Keep it healthy after release:
  1. Continuous Improvement
    Pipeline is never truly done:

Typical ML pipeline:

flowchart TD A[Business Problem] --> B[Data Ingestion] linkStyle 0 stroke: blue; B --> C[Data Lake/Feature Store] linkStyle 1 stroke: blue; C --> D[Preprocessing] linkStyle 2 stroke: blue; D --> E[Feature Engineering] linkStyle 3 stroke: blue; E --> F[Model Training & Evaluation] linkStyle 4 stroke: blue; F --> G[Model Registry] linkStyle 5 stroke: blue; G --> H[Deployment: Batch/Real-Time] linkStyle 6 stroke: blue; H --> I[Monitoring & Feedback Loop] linkStyle 7 stroke: blue; I --> B linkStyle 8 stroke: blue;

From raw data to live, monitored model — in a way that’s close to how companies actually build it in production.

Problem Definition & Requirements
Our case: Fraud detection

Data Ingestion:

Data Preprocessing:

Feature Engineering:

Model Training:

Model Evaluation:

Packaging & Deployment:

Monitoring & Maintenance:

Continuous Improvement:

Full ML pipeline diagram for the project

             ┌─────────────────────────┐
             │ 1. Problem Definition   │
             │ - Goal, metrics, SLA    │
             │ - Stakeholder alignment │
             └───────────┬─────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────┐
│ 2. Data Ingestion (Best practice: version everything)    │
│ - Source: S3 bucket / Data lake                          │
│ - Tool: boto3 or AWS CLI, Airflow DAG                    │
│ - Script: data_ingestion.py (save_data_local)            │
│ - Store raw CSV + Parquet in data/raw/                   │
└─────────────────────────┬────────────────────────────────┘
                          │
                          ▼
┌────────────────────────────────────────────────────────────────────────┐
│ 3. Data Preprocessing (Best practice: same logic in train & inference) │
│ - Tool: Pandas / PySpark                                               │
│ - Script: preprocessing.py                                             │
│ - Save output in data/processed/                                       │
│ - Package transformations in sklarn.Pipeline                           │
│ - Versioned with DVC or stored in Feature Store                        │
└─────────────────────────┬──────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ 4. Feature Engineering (Best practice: store features in Feature Store) │
│ - Tool: Pandas, scikit-learn, Feature Store (Feast)                     │
│ - Script: feature_engineering.py                                        │
│ - Examples: rolling stats, frequency counts, embeddings                 │
│ - Store reusable features for multiple models                           │
└─────────────────────────┬───────────────────────────────────────────────┘
                          │
                          ▼
┌───────────────────────────────────────────────────────────────────────────┐
│ 5. Model Training (Best practice: reproducibility & tracking)             │
│ - Tool: scikit-learn, XGBoost, LightGBM, PyTorch                          │
│ - Experiment tracking: MLflow / Weights & Biases                          │
│ - Script: train.py                                                        │
│ - Config-driven parameters (config.yaml)                                  │
│ - Save model artifacts to models/ and register in MLflow Model Registry   │
└─────────────────────────┬─────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────┐
│ 6. Model Evaluation (Best practice: report for stakeholders)    │
│ - Tool: scikit-learn metrics, Matplotlib, Seaborn               │
│ - Script: evaluate.py                                           │
│ - Output: metrics_{version_tag}.json + HTML report in reports/  │
│ - Sign-off before deployment                                    │
└─────────────────────────┬───────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ 7. Packaging & Deployment (Best practice: CI/CD automated deployment)   │
│ - Format: MLflow model / joblib / ONNX                                  │
│ - Real-time: FastAPI → Docker → AWS ECS / SageMaker                     │
│ - Batch: batch_inference.py → scheduled via Airflow                     │
│ - CI/CD: GitHub Actions / GitLab CI for build & deploy                  │
└─────────────────────────┬───────────────────────────────────────────────┘
                          │
                          ▼
┌───────────────────────────────────────────────────────────────────────────┐
│ 8. Monitoring & Maintenance (Best practice: alerting on drift & latency)  │
│ - Tools: Evidently AI (data drift), Prometheus + Grafana (latency, uptime)│
│ - Log predictions + actuals to monitoring DB                              │
│ - Alerts via Slack / PagerDuty                                            │
└─────────────────────────┬─────────────────────────────────────────────────┘
                          │
                          ▼
┌───────────────────────────────────────────────────────┐
│ 9. Continuous Improvement                             │
│ - Retraining triggers: schedule or drift detection    │
│ - A/B testing new models                              │
│ - Incremental feature additions                       │
└───────────────────────────────────────────────────────┘

Key Best Practice Highlights

Airflow DAGs

Airflow DAGs are useful in an ML pipeline because they solve a very practical problem: getting all the steps of your pipeline to run automatically, in the right order, at the right time, and with the right dependencies tracked.

Here’s why teams use Airflow instead of ad-hoc scripts:

  1. Scheduling & Automation
  1. Dependency Management
  1. Versioned & Reproducible Workflows
  1. Scalability
    Airflow can run steps on different machines or containers, not just your laptop. Example:

    • Data preprocessing runs on a Spark cluster.
    • Model training runs on a GPU node.
    • Deployment step pushes to AWS SageMaker.
  2. Monitoring & Alerts

In development phase, it’s often better to:

Once your preprocessing, training, and evaluation scripts stop changing daily,

DVC’s Role in ML Pipelines

DVC (Data Version Control) is not a replacement for Airflow, MLflow, or your scripts. It’s a data & artifact versioning system with reproducibility baked in. Think of it as Git for data and models.

  1. Versioning Data
    Tracks raw data, processed data, feature sets. Example:

    dvc add data/raw/fraud_data_2025-08-13.csv
    dvc push
    

    Ensures you can reproduce any experiment with the exact same dataset.

  2. Versioning Features & Models
    After preprocessing or feature engineering:

    dvc add data/processed/features_v1.parquet
    dvc add models/fraud_model_v1.pkl
    dvc push
    

    DVC tracks changes in features/models, so your Airflow DAG can always pull the right version for training or inference.

  3. Linking Pipeline Stages
    DVC can define stages with dependencies: raw data → preprocessing → features → training → evaluation. Each stage:

    • Knows its inputs and outputs
    • Can re-run only if inputs change

    Example:

    stages:
      preprocess:
        cmd: python src/preprocess.py
        deps:
          - data/raw/fraud_data.csv
        outs:
          - data/processed/features.parquet
      train:
        cmd: python src/train.py
        deps:
          - data/processed/features.parquet
        outs:
          - models/fraud_model.pkl
    
  4. Interaction with Airflow

  1. Interaction with MLflow

✅ In short:

How to Use DVC

DVC starts locally, but its real power comes from remote storage integration. Let me clarify:

  1. Local tracking
  1. Remote storage
    You can configure DVC to use S3, GCS, Azure Blob, SSH, or even shared network drives as a remote storage.

    dvc remote add -d myremote s3://mybucket/ml-data
    dvc push
    dvc pull
    

    Now your data, features, and models are centralized and accessible to other team members or servers.

  2. Pipeline reproducibility
    Even if the pipeline runs on another machine or server, DVC can pull the exact same dataset and features for training.

✅ In short:

DVC isn’t just for raw data. Best practices:

Category Example Notes
Raw data data/raw/... Immutable, pulled from S3, external sources, or dumps
Processed / feature data data/processed/features.parquet Store intermediate outputs that are expensive to compute, especially for large datasets
Trained models models/model.pkl, models/xgboost/ Any model artifacts you want to version for reproducibility
Metrics / reports reports/metrics.json, reports/figures/ Optional but helpful to track experiments
Large experiment artifacts embeddings, vector stores, checkpoints Anything too big for Git but needed to reproduce results

Rule of thumb: anything big, expensive to compute, or non-deterministic should go through DVC.

Let’s walk through a realistic DVC story with S3 using our fraud detection pipeline, step by step, and show how dvc.yaml orchestrates it.

Scenario

You have:

Step 1: Initialize DVC

git init
dvc init
git add .dvc .dvcignore .gitignore 
git commit -m "Init DVC for fraud pipeline"

Step 2: Configure remote storage (S3)

dvc remote add -d s3remote s3://fraud-data-bucket/dvc-storage
dvc remote modify s3remote access_key_id <YOUR_KEY>
dvc remote modify s3remote secret_access_key <YOUR_SECRET>

Step 3: Track raw data with DVC

# local file you just downloaded
dvc add data/raw/fraud_2025-08-13.csv
git add data/raw/fraud_2025-08-13.csv.dvc .gitignore
git commit -m "Track raw fraud dataset with DVC"
dvc push # sync to S3 remote

DVC stores the file hash and keeps a pointer in Git. If the file content changes, DVC knows automatically. In our project, raw data was tracked as a stage output (from ingest) rather than via manual dvc add. Both approaches work - if a file is produced by a script, define it as an out in dvc.yaml; if its aone-off dataset you downloaded, use dvc add.

Step 4: Define DVC pipeline (dvc.yaml) - (file-level DAG)

DVC enables automatic reproducibility: dvc repro reruns only what changed.

stages:
  preprocess:
    cmd: python src/preprocess.py --input data/raw/fraud_2025-08-13.csv --output data/processed/features.parquet
    deps:
      - src/preprocess.py
      - data/raw/fraud_2025-08-13.csv
    outs:
      - data/processed/features.parquet

  train:
    cmd: python src/train.py --input data/processed/features.parquet --output models/fraud_model.pkl
    deps:
      - src/train.py
      - data/processed/features.parquet
    outs:
      - models/fraud_model.pkl
    metrics:
      - reports/train_metrics.json
git add dvc.yaml dvc.lock params.yaml
git commit -m "pipeline +params"

Step 5: Run the pipeline

dvc repro  # runs only needed stages (hash-based)

In production, Airflow DAGs call these same scripts. DVC ensures exact versions of inputs/outputs, while Airflow handles scheduling and orchestration.

DVC automatically checks hashes of deps:

Step 6: Push artifacts to S3

dvc push  # pushes data to S3 remote
git push  # pushes code+pointers (no big files in Git)
dvc pull
dvc repro

They get the exact same data + features + model, fully reproducible.

Step 7: Track experiments

After setting up this:

When you manually tag your data (e.g., fraud_2025-08-13.csv), you control the version but that version is not tied to the content of the file. DVC versioning adds automatic reproducibility:

To quickly check to see what DVC thinks is tracked:

dvc list .
dvc status -c

What dvc push does?

Later, when someone runs dvc pull:

✅ In short:
dvc push takes the local cached data (already added with dvc add) and syncs it to your remote storage so you and your teammates can later dvc pull it anywhere.

What are deps and outs?

What Happens Where
Track inputs (for re-run logic) deps:
Track outputs (for versioning) outs:
Store hashes & timestamps dvc.lock
Cache outputs (reproduce later) DVC internal cache

DVC will handle syncing with remote storage, but it always works from local paths. For any outs of any stage in dvc.yaml, DVC automatically track them so no need to manually add them.

Concept What it does Notes
Manual tag You decide the filename/version Works, but DVC hash adds reproducibility
DVC dep Input file/script that triggers stage rerun Must be local path
DVC out Output file tracked by DVC cache Must be local path; can be pushed to S3 remote
dvc repro Rebuilds stages whose deps changed Uses hashes, not filenames

Key differences vs manual version tags

Feature Manual tags DVC hash workflow
Track raw data Filename only SHA256 hash, exact content tracked
Track processed features Filename + manual tag DVC manages caching & reruns only if input changes
Trigger reruns automatically ❌ Manual dvc repro handles dependencies automatically
Reproducibility Manual ✅ Guaranteed (hash + remote storage)
Team sharing Manual sync dvc push + dvc pull
Remote storage support Manual copy/S3 ✅ DVC handles sync to S3 automatically

In short: DVC replaces manual bookkeeping of versions with hash-based reproducibility and automated reruns. Your version tags can still exist as metadata, but DVC ensures you never accidentally rerun the wrong pipeline or lose a version.

Controlling what you push

Use dvc status -c shows which outputs are in your remote vs local cache. Helps you know what will actually be pushed or pulled.

Git vs DVC

Git does hashing and pointers, but it’s not built for large data and ML pipelines. Here’s why DVC is necessary compared to git:

Feature Git DVC
File size <100MB ideally Any size (GBs, TBs)
Storage Repo grows with data Data stored in remote/cache, Git stores only pointers
Versioning large binaries Inefficient Efficient (hash + remote storage)
Reproducible pipelines dvc.yaml stages, deps/outs
Partial rerun of pipeline ✅ Only stages with changed deps rerun
  1. Git is for code, DVC is for data + experiments
  1. Pipelines
  1. Remote storage & team collaboration

✅ In short:

Best practices for version control

dvc exp run
dvc exp show
dvc exp apply <id>

This helps track metrics and versions without creating permanent Git commits immediately.

Best practices for DVC+Airflow

  1. Use DVC for reproducibility, Airflow for orchestration
  1. Always dvc pull at the start of the DAG
  1. Use dvc repro for specific stages
  1. Push outputs at the end
                   +-----------------+
                   |   S3 Raw Data   |  <-- Remote storage (DVC)
                   +-----------------+
                             |
                             |  dvc pull
                             v
                   +-----------------+
                   |   Local Cache   |  <-- .dvc/cache stores hashes
                   +-----------------+
                             |
                             v
                   +-----------------+
                   | Preprocessing   |  <-- DVC stage
                   |  (pipeline.pkl) |
                   +-----------------+
                             |
                             v
                   +-----------------+
                   | Feature Eng.    |  <-- DVC stage
                   | (features.parquet)
                   +-----------------+
                             |
                             v
                   +-----------------+
                   | Model Training  |  <-- DVC stage
                   | (model.pkl)     |
                   +-----------------+
                             |
                             v
                   +-----------------+
                   | Metrics / Eval  |  <-- optional DVC stage
                   +-----------------+
                             |
                             v
                   +-----------------+
                   | DVC Push        |  <-- Upload processed data, features, models
                   +-----------------+
                             |
                             v
                   +-----------------+
                   | FastAPI Deploy  |  <-- Serve latest model
                   +-----------------+

How DVC + Airflow works

Airflow DAG tasks call:

DVC caching

Versioning

✅ Takeaways

Feature Benefit
Fully reproducible pipeline ✅ Any version_tag can be restored
Efficient re-runs ✅ Only runs when deps change
Works with Airflow or manually ✅ Trigger dvc repro anywhere

How versioning works in this DAG

Other team members can pull the same versions with dvc pull.

Airflow for ML pipelines

We use airflow to run the main parts of the full cycle ML pipeline:

DVC pipeline already explained. We use

Initialize Airflow and Create a Dag

Use the official Airflow docker compose yaml file to run Airflow. Add your own Dockerfile for customizing the image, for example installing extra packages, env variables etc. Airflow docker compose configures and runs backend databases (Redis, Postgres), Airflow Scheduler, Airflow Worker and Airflow Webserver at http://localhost:8000. Airflow won’t show DAGs if syntax errors exist in their .py file.

Airflow creates folder for its operation such as dags/ where we put our DAGs for each pipeline such as dvc_dag.py. This is a DVC versioned pipeline that controls data flow into processing, training models stages which also log/register ML pipelines or models. This DAG meets the following objectives:

Deep MLOps Pipeline (Full ML Lifecycle)

In this project we build a fraud detection pipeline with Airflow, DVC and MLflow along with inference server and some monitoring and observability best practices.

Fraud Detection

Data Schema

Column Name Type Notes
transaction_id int Unique ID
amount float Outliers here
transaction_time float Seconds since account open
transaction_type categorical e.g., “online”, “in-person”
location_region categorical e.g., “US-West”, “EU”
is_fraud binary (0/1) Target — imbalanced

Feature Example

Feature Type Example Features
Numeric Transaction Amount, Time Delta
Categorical Transaction Type, Region
Derived Amount/Time ratio, Z-score outlier
Target (binary) Fraud (1) vs Legit (0)

We use simulated data for and train models for fraud detection. I chose the task because it:

Component Decision
Data Domain Fraud Detection
Data Ingestion CSV, simulate imbalanced + outliers
Data Versioning DVC + structured filenames + metadata
Monitoring Use Case Confidence, drift, outliers, latency

ML Pipeline: DVC + Airflow

Ml pipeline consists of steps from availability of raw data to up the trained models ready for deployment. This the scalable ETL + Preprocessing Pipeline + Training Pipeline with Versioning data and models. This pipeline is orchestrated with Airflow for maximum flexibility

Component Description Tool/Option
Data Preprocessing Save preprocessing params/stats/pipelines sklearn + pickle
Model Versioning Experiement/save models with parametes, inputs MLflow / S3
Data Versioning Track datasets/artifacts used in ML pipeline DVC / manual logging
Pipeline Orchestration Automate full flow Airflow DAGs
Artifact Tracking Logs, models, metrics tracked MLflow / S3
Train Trigger DAG or API starts training on demand Airflow trigger or FastAPI POST

Airflow DAG — Automate Entire Lifecycle

Stage Operator Description
Raw Data Ingestion BashOperator Run ETL with python etl_task.py
Preprocess + Version BashOperator dvc repro preprocess
Train + Version BashOperator dvc repro train
Notify/Log PythonOperator Slack or log output

This pipeline pulls a versioned data tagged (ex, v20250817_175136) from S3, saves it locally at data/raw. (The version tag here represents a sample of real data a model is built using it. This version tag may not be necessary because DVC automatic versioning will be applied instead of manual versioning.)

As data/raw is in the output of a DVC stage in dvc.yaml, it will be tracked by DVC automatically; no need to manually dvc add it. ETL Task simulates data load (e.g., CSV from data_source/, or generate synthetic tabular data), clean nulls, format columns and saves to data/raw/*.csv. Preprocessing stage loads this data as its dependency deps and fits a sklearn preprocessing pipeline (Scale e.g., StandardScaler, impute, encode, feature engineering)which is saved and tracked at artifacts/preprocess. Next, we have two models to train: Outlier Detector and Fraud Detector. DVC stages train_outlier and train_model will run the train logic for each task using raw data in data/raw followed by preprocessor pipeline. Models, their parameters, metrics, sample inputs and related tags are logged, versioned and registered in MLFlow server. Also model artifacts are saved and tracked by DVC at artifacts/models.

All the stages (inputs and outputs) in this pipeline are version controlled by DVC so they only run if previous stages changed. At every stage, versions of outputs outs are cached and pushed to the remote for reproducibility. Anyone can pull versions and reproduce the pipeline quickly. Model tags explicitly contain information (git commit hash) about the data version or the preprocessor version which trained the model. So it is easy to checkout from that specific version and exactly reproduce the pipeline that rained that particular version of the models.

Now your teammate can reproduce the versioned pipeline as follows without needing to have the original data at all by cloning this Git repo:

git clone <this_repo>
cd <this_repo>
dvc pull 

After this:

That's it! No need for the original CSV data file or .pkl artifacts to be present. Thats exactly where DVC shines. If only data needed to be pulled, run dvc pull data/raw. If only a preprocess pipeline of a particular version needed, run

git checkout <commit-or-tag>  # pick the corresponding commit with the version
dvc pull  # downloads the exact deps/outs from remote
dvc repro preprocess. # returns the stage if the code has changed

DVC remote storage is configured (S3, MinIO, GCS, etc.) so any output you choose to track is backed up in the cloud — but not cluttering your laptop/disk.

Note that we didnt save the processed data (clean data).

Element Our Decision
Clean Data File Not saved — we avoid storing processed data
Preprocessing Pipeline ✅ Saved as artifact (pipeline_{tag}.pkl)
Training Data Source ✅ Apply pipeline on raw data again at training time
Result Minimal storage, full reproducibility, modular and scalable

Enforce Data Consistency

To ensure data is consistent at training and inference:

  1. Preprocess stage saves:
    • Feature names (features_final)
    • Scaler/encoder (e.g., pickle)
  2. Train script loads these
    • Enforces data format before training
    • Saves the artifacts again for inference reuse
  3. Inference step validates:
    • Incoming data columns == expected columns
    • Version match check via version_tag_meta.json
Step File/Artifact
Preprocess saves preprocess_metadata.json
Train loads + uses scaler, encoder, feature names
Inference uses same Load scaler/encoder + validate

DVC+Airflow+MinIO

To set up remote repo for DVC, You need to populate the .dvc/config with remote and credentials to connect to it.

# .dvc/config
[core]
    remote = minio
['remote "minio"']
    url = s3://mlflow-artifacts
    endpointurl = http://minio:9000
    access_key_id = minioadmin
    secret_access_key = minioadmin
    use_ssl = false

You can do this by running the following commands:

#!/bin/bash
dvc remote add -f minio s3://mlflow-artifacts
dvc remote modify minio endpointurl http://minio:9000
dvc remote modify minio access_key_id minioadmin
dvc remote modify minio secret_access_key minioadmin
dvc remote modify minio use_ssl false
dvc remote default minio

You can clean a remote using dvc remote remove <previous-remote>. You can use mc or Terraform/Ansible to set bucket policy at startup:

mc alias set minio http://minio:9000 minioadmin minioadmin
mc mb --ignore-existing minio/mlflow-artifacts
mc anonymous set public minio/mlflow-artifacts

You can test from inside the actual worker container using:

aws --endpoint-url http://minio:9000 s3 ls s3://mlflow-artifacts
dvc push

Why use DVC with S3/MinIO remote?

What you shouldn't do with DVC + S3 remote

Don’t directly write pipeline outputs to S3 URLs in outs. DVC can’t cache remote outputs and can’t reproduce reliably. Always write outputs locally and let DVC push them to remote for versioning and storage.

It generally does not make sense to have deps or outs pointing directly to S3 (or MinIO) in your dvc.yaml. Why?

Typical Workflow with DVC, Airflow, MLflow & S3

  1. Data storage (S3 / MinIO)

    • Your raw data (e.g., logs, images, CSVs) is stored in a remote S3 bucket (MinIO).
    • This is your source of truth raw data, immutable and accessible from anywhere.
  2. Data ingestion / Extraction (Airflow)

    • Airflow orchestrates the entire ML pipeline as a workflow with multiple tasks.
    • First step: download or copy the raw data from S3 into a local workspace inside your Airflow worker (or an ephemeral container).
    • This can be a DVC pull operation to bring a specific version of data or a direct download via awscli or mc command.
  3. Preprocessing & Feature Engineering (DVC)

    • Now, you preprocess the raw data locally (cleaning, feature extraction, transforms).
    • DVC tracks this step as a pipeline stage:
    • The input: raw data files (local copies)
    • The command: preprocessing script (e.g. python preprocessing.py)
    • The output: processed data stored locally (e.g. data/processed/)
    • DVC tracks all input/output files by hashing content, so you know exactly which version of input produced which output.
  4. Model Training

    • Next pipeline stage: train your ML model using the processed data.
    • The output: trained model files locally (e.g. models/model.pkl).
    • DVC tracks the training stage, inputs, outputs.
    • You log metrics, parameters, artifacts, and model versions in MLflow.
    • MLflow acts as your model registry + experiment tracker.
  5. Push artifacts and data

    • After each stage, you push the generated artifacts and processed data to S3 (your remote DVC storage) with dvc push.
    • This lets all collaborators reproduce the pipeline and retrieve exact versions of data and models.
  6. Deployment & Monitoring

    • Once the model is trained and registered, you might have:
    • Airflow tasks to deploy the model (e.g. to SageMaker, KFServing).
    • Airflow jobs to monitor model performance, data drift.
    • MLflow holds the model versions & deployment metadata.
    • DVC ensures full reproducibility of data and models.
Step Tool(s) What it does
Data storage S3 / MinIO Store raw and processed data remotely
Pipeline orchestration Airflow Schedule and monitor pipeline stages
Data versioning DVC Track input/output files, pipeline stages
Model versioning MLflow Log metrics, register model versions
Execution Airflow triggers DVC cmds Run stages like preprocess, train, push data
Deployment & Monitoring Airflow + MLflow Deploy model, monitor, trigger retraining

Why not only DVC?


We don't save processed data here because it is just easy to apply the saved preprocess pipeline on the raw data at every stage, just like it is for inference later.


Model Registry: MLflow

A professional-grade model registry used for model versioning, rollback, promotion, audit trails, and safe deployment. What Is a Model Registry? A Model Registry is like Docker Hub for ML models:

MLflow Model Registry

Feature MLflow Registry Comparable in Software
Model Versioning Each model gets a version Like Docker tags: v1, v2
Promotion & Rollback Move to “Production” stage Like Git branches/tags
Storage Backend Local, S3, GCS, Azure Like Docker Hub or Artifactory
UI Dashboard Track models visually Like DockerHub Web UI
Integration Airflow, FastAPI, DVC, etc. Seamless in pipelines

MLflow vs DVC

Aspect MLflow DVC
Primary Purpose Model tracking & deployment Data + model versioning for development
Stores Artifacts Yes (models, metrics) Yes (models, datasets)
Experiment Tracking Built-in (metrics, params): Every training run auto-logged + versioned No (but can log separately)
Rollback Support Yes (model promotion): Easily deploy previous model version Manual checkout
UI Dashboard Yes (MLflow UI): Track runs, metrics, artifacts, and models via browser No UI for registry
Integration REST API, Python, Airflow Git, CLI
DVC + MLflow Co-exist: MLflow for registry Co-exist: DVC for pipeline

🔹 DVC is:

🔹 MLflow is:

🔹 Workflow pattern in many teams

So you can think of DVC as a pre-production (research + reproducibility) tool, and MLflow as the production-facing registry/serving tool.

MinIO prep - daul network (docker compose)

MinIO Prep: Create Bucket mlflow-artifacts
After starting MinIO:

Since you're using MinIO (an S3-compatible object store) for MLflow artifacts, MLflow uses boto3 (AWS SDK for Python) under the hood to access and download models from S3 (MinIO).

To keep databases isolated (also we have 2 postgres db), you want to:

Solution: Dual-Network MLflow

networks:
  airflow_net:
    external: true
  mlflow_net:
    driver: bridge

Put mlflow service on both networks:

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.11.1
    ...
    networks:
      - airflow_net
      - mlflow_net

MLFlow Model Registry (Docker)

Setup MLflow with MinIO (resembles S3) via Docker Compose

Feature Status
Metrics logging ✅ Done
Artifact logging ✅ Done
Model versioning ✅ Done
Rollback possible via UI ✅ Ready
Dockerized MLflow server ✅ Running
MLflow ↔ Airflow Integration 🔧 In progress (network fix pending)

Inference Pipeline

We create a FastAPI server to deploy our inference logic and to expose inference metrics which is critical for monitoring of model performance and creating active alerts. FastAPI endpoint for single online inference applies:

Inference Pipeline Plan

  1. Integrate registry with inference to dynamically load latest or specific versions.
  2. Input Handling (Robustness Layer)
    • Accept inputs as JSON, CSV, or API payload.
    • Validate schema: column names, data types.
    • Handle missing values, unexpected categories, or out-of-range numerical values.
      ✨ Use pydantic for schema validation (popular in FastAPI).
  3. Preprocessing & Outlier Handling
    • Use preprocessor.transform(X) to apply exact same transformations as training.
    • Detect and optionally flag/remove outliers: Z-score or IQR for numeric.
    • Novel category handling for categorical (via handle_unknown='ignore').
  4. Prediction Logic
    • Apply pipeline to get online predictions + confidence scores.
    • Optionally support batch inference or streaming (FastAPI, Kafka, Airflow).
  5. Logging + Monitoring (Best Practice)
    • Log each request + response for auditability.
    • Track model drift, data drift, input distribution changes.
    • Send metrics to Prometheus.
  6. Error Handling + Alerts
    • Catch inference failures, malformed inputs.
    • Return meaningful error messages (JSON with error codes).
    • Integrate with alerting system (email, Slack, etc.) or for automatic healing (model rollback)

Project Layout

project/
│
├── inference/
    ├── app/
    │   ├── __init__.py 
    │   ├── metrics.py 
    │   ├── model_loader.py   # Load pipeline & metadata
    │   ├── predict.py 
    │   ├── schema.py   # Input/Output Pydantic models
    │   ├── utils.py      # Optional: input checks outlier detection
    │   └── inference.log        # Logs
│   ├── docker-compose-inference.yaml
│   ├── Dockerfile
│   ├── main.py   # FastAPI app (main file)
│   ├── requirements.txt
├── artifacts/
│   ├── preprocess/pipeline.pkl
│   └── models/model.pkl   # Saved full pipeline
│
├── version_meta.json

Inference Types

  1. Online Inference (Real-Time or Near Real-Time)
  1. Batch Inference (Offline or Async)

Robust Inference – Best Practices

Goal: prevent the model from making wild predictions on anomalous inputs.

To safeguard model prediction, we’ll fit IsolationForest as outlier detector on the preprocessed training data and save it. It

At inference, we first find its prediction on the input data. If predicted "outlier", we do not try to get Fraud Model prediction.

We also added:

In production systems, single vs. batch inference are often handled as separate endpoints for clarity, performance tuning, and scalability. Here's how it's treated in real systems:

Use Case API Endpoint Example Reason for Separation
Single Inference POST /predict Simple, low latency, immediate feedback
Batch Inference POST /predict/batch Vectorized operations, better throughput, async-friendly

Why Separate?

Full Next Steps Plan

  1. Monitoring & Metrics
    • Ensure your inference endpoint exposes metrics at /metrics (Prometheus format) if you haven’t done it yet.
    • Add logging for batch inference with success/failure and summary stats.
    • Consider implementing basic monitoring for model drift, input feature distribution, and latency.
    • Optionally add alerting hooks for anomalies in prediction or performance.
  2. Batch Inference + Airflow Integration
    • Make sure batch inference runs smoothly daily via Airflow DAG with logging.
    • Add error handling and retry logic.
    • Store batch output and metrics in versioned files.
    • Explore more advanced scheduling or event-driven triggering if needed.
  3. Inference Endpoint Robustness
    • Input validation and type checking for single and batch requests.
    • Outlier detection integration to handle edge cases gracefully.
    • Optionally include lightweight preprocessing checks (e.g., range checks, missing values).
    • Consider adding rate limiting or authentication if deploying publicly.
  4. Documentation & Readme
    • Write clear README summarizing:
    • Data flow pipeline (ETL, preprocessing, training, outlier, inference, batch, registry)
    • How to run locally, with Docker, and Airflow
    • How to add new data, retrain, and deploy updated models
    • How to monitor metrics and logs
    • Include example API requests and responses.

Monitoring

Monitoring and observability is the the critical part of any system. Without them we do not exactly know whether system is sharp and healthy or if not, where the problem might be. Among common tools to help us with observability is Prometheus for collecting metrics and creating alarms on them directly using its Alert Manager. We also need to connect Prometheus to a visualization tool such as Grafana to create dashboards with panels to visually watch how desired metrics are changing. Grafana also allows us to create alerts on those dashboards. These alerts will fire when conditions met to trigger some defined actions such as sending notifications, rolling back models and so on.

Complete ML System Monitoring

Scope What We Monitor
Model Performance Accuracy, F1, Drift, Outlier %, Fraud Rate
API Inference Server (FastAPI) Request count, latency, error rate, throughput
Batch Jobs (Airflow) Task duration, status (success/failure), retrain triggers
System Health CPU, RAM, Disk (Docker containers)

Model Performance Monitoring - Example

For example to monitor model performance, we can expose evaluation metrics via an endpoint (FastAPI, see ml_metrics/) so Prometheus can scrape an endpoint /model-metrics. Then build a Grafana dashboard + panels to visualize them and set alerts with thresholds to activate and notify admin or proceed with other actions. I created a FastAPI server to store model metrics and expose them for

Two Key Alerts in Grafana
Alert Name Trigger Condition
🚨 Accuracy Drop model_accuracy < 0.85
🕒 Slow Training training_duration_seconds > 2.0 seconds

All the alerts can be logged in Redis (for history/audit) add actions on the alerts such as Auto-retraining model if accuracy alert fires.

Provisioning resources manually is not scalable, auditable and in short, not the best practice. We will do this using YAML/JSON file.

Monitoring Blueprint

  1. Expose Inference Server Metrics (via /metrics)
    • Used prometheus_fastapi_instrumentator to automatically /metrics to track: latencies, counts, status codes, etc.
    • Used prometheus_client to define custom metrics: outliers count, fraud count, request count, average fraud score, inference latency using Histograms
  2. Expose Airflow Metrics (via Prometheus Exporter)
    • Airflow can emit metrics to Prometheus via statsd or prometheus-exporter
    • Monitor: DAG run duration, task failures, retries
  3. Container Health Metrics (Prometheus Node Exporter / cAdvisor)
    • Monitors Docker container resource usage
    • CPU %, memory, disk I/O, network
    • Scrape via Prometheus
    • Grafana: dashboards for resource bottlenecks
  4. Grafana Dashboards (Unified View)
    • Dashboard 1: Model Performance over time
    • Dashboard 2: API Inference Traffic (live)
    • Dashboard 3: Airflow Batch Job Monitoring
    • Dashboard 4: System Resources (cAdvisor)
  5. Optional: Alerting Rules
    • Slack/Email alerts if:
    • API error rate > 5%
    • Fraud rate spike
    • Batch job fails
    • CPU > 90% for 5min

Inference Monitoring

Metric Alert Strategy
Class distribution drift Alert if % fraud spikes
Confidence score drop Alert if low confidence common
Outlier count increase Alert on statistical outlier spike
Latency per request Real-time latency alert (for inference)

Instrument FastAPI

Add Prometheus Instrumentation to FastAPI which exposes /metrics Endpoint Automatically. This handles automatically by instrumentator.expose(app) — Prometheus can now scrape this.

Library Purpose Recommendation
prometheus-client Low-level metric creation (Counters, Gauges) ✅ Use both
prometheus-fastapi-instrumentator Auto-instrument FastAPI (latency, error rate, etc.) ✅ Use for API metrics

Use prometheus-fastapi-instrumentator for automatic API monitoring,
AND use prometheus-client to define custom metrics like fraud_predictions_total.

Prometheus Alerts

Using YAML-based configuration gives you reproducibility, automation, and portability — key principles for any production-grade monitoring stack. Here’s how you can achieve full YAML-based alerting and monitoring:

Use Alertmanager for Alerts, YAML-Configured, because

How to Set Prometheus Alerts Up - Alertmanager

Alerts to Trigger an Action: Model Rollback via MLflow

We can add an Airflow task for auto rollback on model performance drop - Example:

This is where MLflow model versions + tags becomes useful. After setting up an alert such as "High Inference Latency", configure alert manager to have a receiver for this alert such as an API endpoint using hooks. In our case we used fastapi-hookto configure alert manager to send an high inference latency alert to our model server at http://inference-api:8000/alert which will handle the model rollback using MLflow and Airflow.

To test this, construct an intentionally slower version of the current model in production by subclassing sklearn Logistic Regression and promote it to production using /dags/test_model_rollback.py. After receiving traffic, the high latency activates the alert which send a POST request to inference endpoint /alert. This part runs an Airflow DAG to rollback the model by de-promoting the running model to stage level and send back a signal to inference serve /rollback_model to reload the serving model which automatically loads the previous model in production.

Automated Rollback Triggers:

Component Manual / Automated
MLflow Model Register Automated in train_fraud_detector_task.py
Rollback Decision Optional: manual OR automated
Model Rollback Automated via Airflow DAG
Inference Server Load Automated as it's using dynamic load
Alerts to Trigger Automated (Prometheus → Alertmanager → FastAPI → DAG)

How Models Get Loaded

Approach Description
Static load (file) Loads .pkl model file once. Cannot rollback automatically.
MLflow Registry Load Always loads "Production" model via MLflow URI.
Reload Endpoint Allows triggering reload manually or via webhook.

Most Common Situations for Model Rollback :

Aspect Manual Rollback Automated Rollback
Trigger Human decision, often after monitoring. Automated metrics (latency, accuracy) trigger rollback.
Common In High-risk domains: finance, healthcare. Low-latency systems, e.g., recommender engines, e-commerce.
Tools Used MLflow UI, scripts, CI/CD tools Airflow, Kubernetes, Argo, Prometheus + Alertmanager
Typical Time to Rollback Minutes to hours. Seconds to minutes.

Mature MLOps setups rely on:

This is the critical decision point in production MLOps: When exactly should a model rollback be triggered automatically?

This URL is used in FastAPI to trigger a DAG in Airflow via its REST API.

AIRFLOW_TRIGGER_URL = "http://airflow-webserver:8080/api/v1/dags/model_rollback/dagRuns"

This URL is:

Model Rollback Mechanism based on High Latency Inference

After model is deployed inot production, latecy in inference increases shaply for some time (say 2m). High Inference Latency alert (Prometheus/Grafan alert - our case Prometheus) fires, and hits the Fastapi /alert endpoint which in turn, sends POST request to an Airflow DAG to start rollback the model to previous stable version. This module finds the previous version, depromotes the current version to staging from production and send the previous version back to a fastapi endpoint /model_rollback ro relaod the prevous model for inference.

The main pipeline is logged and loaded using mlflow.sklearn.log_model or mlflow.sklearn.load_model which create a "sklearn flavor" model.

I had diffculty simulating a "delayed model" to be used for testing this pipeline. The idea was to use a model in production, device some deplay its pipeline and register it as the new vesion which goes to production. I subclassed a LogisticRegression instance DelayedLogisticRegression() to put sleep in time in it prediction methods and registered it.

class DelayedLogisticRegression(LogisticRegression):
    def predict(self, X):
        time.sleep(5)
        return super().predict(X)

    def predict_proba(self, X):
        time.sleep(5)
        return super().predict_proba(X)

At the time of loading it for inference, it get error

ModuleNotFoundError: No module named 'unusual_prefix_83f8cee858e09b35f281415321530c3cdc750909_test_model_rollback'

When a custom model class is saved using cloudpickle, it stores the full module path. If your script/module structure has changed since the model was saved (e.g., different filename, renamed class, or the model was saved inside a notebook with a weird module name), MLflow can't locate the exact same class to unpickle. This means that cloudpickle expects that same module structure at load time. So I created a module in utils containing the subclass definition and made it avaialabe at loading time in the same path used when logging and registering so the import (from utils.delayed_model import DelayedLogisticRegression) works normally at loading (no need to put this line is script when loading beause it is not used explicitly but implcitly and internally when unpickling). This was an elegant solution to preserve sklearn flavor, keep things modular and clean so i could still keep the same methods mlflow.sklearn.log_model or mlflow.sklearn.load_model working for the customed delayed pipeline. Also put /utils in PYTHONPATH variable enviroment so python finds it when importing - i used ENV ... in Dockerfile. The other option would be to used mlflow.pyfuncfor logging and loading which is a bit more invovled.

Now we just built a self-healing ML pipeline:

We’ve operationalized:

Grafana as Code

What we built:

Grafana provision dashboards + alerts on startup from YAML/JSON files. All configurable, reusable, version-controlled.

/project/
└── monitoring/
    ├── grafana/
    │   ├── dashboards/
    │   │   ├── system_monitoring.json
    │   │   └── model_monitoring.json   # Prebuilt dashboard
    │   └── provisioning/
    │       ├── alerting/
    │       │   ├── alerting_rules.yaml  
    │       ├── dashboards/     # Links dashboards at startup
    │       │   ├── dashboards.yaml  
    │       ├── datasources/     # Optional: Grafana alert rule
    │       │   └── datasource.yaml 
    ├── alertmanager/
    │   ├── alertmanager.yaml
    ├── prometheus/
    │   ├── alert_rules.yaml
    │   ├── prometheus.yaml
    ├── open_telemetry/
    │   ├── otel-config.yaml

Let's do a complete example: first configure Prometheus as a source for grafana using a yaml file such as grafana/datasources/datasource.yaml. Then we can auto-provision a Grafana dashboard + an alert for high inference latency using only YAML:

Different dashboards = different monitoring concerns:

File Purpose
model_monitoring.json Dashboard with model-level metrics/inference_monitoring (e.g., fraud prediction count, outliers count, inference latency (percentile 95), average inference latency etc.)
system_monitoring.json Dashboard focused on system-level metrics, CPU usage, Memory usage, Dicsk usage etc.

For System Monitoring Dashboard, we use node-exporter (is per host machine not per container) or Docker stats for container-level metrics. Add it to the same docker-compose as Prometheus for simplicity. Prometheus scrapes http://node_exporter:9100/metrics. Node-exporter exposes metrics on http://localhost:9100/metrics from the host system itself. Prometheus (inside Docker) can scrape it using host.docker.internal:9100 if on Mac/Windows, or using the actual IP on Linux. Replace mountpoint: should be /root of your system, in my case /vscode. Node Exporter gathers system metrics (CPU, memory, disk, network).

Logging and Tracing

Basic tracing is worth it, especially for a real-time ML inference pipeline. It is useful for:

[Your ML Service / API Container]
   ↓ sends metrics → Prometheus
   ↓ sends traces  → OpenTelemetry → Jaeger
   ↓ logs          → (stdout or ELK/other)

Use Jaeger (OpenTelemetry Backend) for tracing spans and full request paths:

Use opentelemtry-sdk with OTLPExporter to send traces to the Collector. Open Jaeger UI localhost:16686 -> search for your services -> see spans, timing and call paths. Create tracing object in a module utils/tracing/tracing.py and import them when needed

Instrument your FastAPI app

You can also add tracing spans and integrate them smoothly with your existing logging code:

  1. Set Up OT middleware for FastAPI
    Now instead of manually creating spans everywhere, use OpenTelemetry's automation instrumentation middleware for FastAPI.
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()

# Instrument app to automatically create spans for incoming HTTP requests
FastAPIInstrumentor.instrument_app(app)

This will automaticall trace every request, capture latency , HTTP status, route etc.

  1. Add manual spans inside imprtant business logic
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@app.post("/predict")
def predict(...):
    with tracer.start_as_current_span("predict_handler"):
        # your prediction logic here
        ...

This will create span named "predict_handler" that wraps your predict call, showing up in Jaeger UI

  1. Add trace context to logs for easy correlation of your logs with traces by adding the trace ID and span ID in log messages.

Useful Docker Comands :

docker-compose build --no-cache  # builds with ignoring cache
docker stop $(docker ps -aq)  # Stop all containers
docker rm $(docker ps -aq)   #Remove all containers
docker volume ls  # Identify airflow-related volumes
docker volume rm project_postgres-db-volume  # Replace with real names
docker network create -d bridge airflow_net # Create a shared network
docker network rm airflow_net  # Delete network
docker network inspect airflow_net # See which services are inside the network